Back in the day using ADO, we used GetRows() to pull back an array and loop through it, because it was faster than using rs.MoveNext to walk through records. I'm writing an application that pulls back half a million rows and writes them out into a file. Pulling the data from SQL takes about 3 minutes, but writing it to a CSV is taking another 12 minutes. From the looks of it, it's because I'm looping through a SqlDataReader. What is a faster alternative?
Keep in mind, I do not know what the SQL Structure will look like as this is calling a reporting table that tells my application what query should be called. I looked at using linq and return an array, but that will require knowing the structure, so that will not work.
Note the below code, case statement has many cases, but to cut down on space, I removed them all, except one.
StringBuilder rowValue = new StringBuilder();
SqlDataReader reader = queryData.Execute(System.Data.CommandType.Text, sql, null);
//this is to handle multiple record sets
while (reader.HasRows)
{
for (int i = 0; i < reader.FieldCount; i++)
{
if (rowValue.Length > 0)
rowValue.Append("\",\"");
else
rowValue.Append("\"");
rowValue.Append(reader.GetName(i).Replace("\"", "'").Trim());
}
rowValue.Append("\"" + Environment.NewLine);
File.AppendAllText(soureFile, rowValue.ToString());
while (reader.Read())
{
rowValue = new StringBuilder();
for (int i = 0; i < reader.FieldCount; i++)
{
String value = "";
switch (reader.GetFieldType(i).Name.ToLower())
{
case "int16":
value = reader.IsDBNull(i) ? "NULL" : reader.GetInt16(i).ToString();
break;
}
if (rowValue.Length > 0)
rowValue.Append("\",=\""); //seperate items
else
rowValue.Append("\""); //first item of the row.
rowValue.Append(value.Replace("\"", "'").Trim());
}
rowValue.Append("\"" + Environment.NewLine); //last item of the row.
File.AppendAllText(soureFile, rowValue.ToString());
}
//next record set
reader.NextResult();
if (reader.HasRows)
File.AppendAllText(soureFile, Environment.NewLine);
}
reader.Close();
The problem here is almost certainly that you are calling File.AppendAllText() for every row. Since AppendAllText opens, writes, then closes the file every time it is called, it can get quite slow.
A better way would be either to use the AppendText() method or else an explicit StreamWriter.
Related
I am doing a code where I compare two columns of DGV roles, the first DGV (DGV1) has the raw data with duplicate roles, and the second DGV (DGV4) is a dictionary with all existing roles (no duplicates), it has to go to each row of the dictionary and if the role exists in the DGV1, it should be removed from the dictionary, leaving only the roles in the dictionary that are not currently being used in the raw data. My code is removing the roles, but when the dictionary has a value that doesn't exist in DGV1, it stops working (DGV1 continues to loop until it has an index error). Any suggestion?
NOTE: The rows in the dictionary automatically go to the first index, so there is no need to increment int i.
int eliminado = 0;
int filasDGV1 = dataGridView1.Rows.Count;
int filasDGV4 = dataGridView4.Rows.Count;
int i = 0;
int j = 0;
do
{
string perfilVacio = dataGridView4["GRANTED_ROLE", i].Value.ToString();
string perfiles = dataGridView1["GRANTED_ROLE", j].Value.ToString();
if(perfiles != perfilVacio)
{
j++;
}
else if(perfiles == perfilVacio)
{
dataGridView4.Rows.RemoveAt(i);
}
}
while (eliminado <= filasDGV4);
The first excel is DGV1 and the other is DGV2, I highlighted where is the code looping currently
The orange highlight is where the loop change in DGV1 but in the dictionary doesnt exist so its stuck there
Change your loop condition to include a test for the changing index j and also to check whether there are rows left to be eliminated.
int filasDGV1 = dataGridView1.Rows.Count;
int j = 0;
while (j < filasDGV1 && dataGridView4.Rows.Count > 0)
{
string perfilVacio = dataGridView4["GRANTED_ROLE", 0].Value.ToString();
string perfiles = dataGridView1["GRANTED_ROLE", j].Value.ToString();
if(perfiles == perfilVacio)
{
dataGridView4.Rows.RemoveAt(0);
}
else
{
j++;
}
}
If you test perfiles != perfilVacio in if you don't have to test perfiles == perfilVacio in else if, because this automatically the case. Either they are equal or they are not. There no other possibility.
Also, it is generally more readable if you ask a positive question in if like == instead of a negative one like !=.
Since i is always 0 I replaced it by the constant 0. The variable eliminado is not required (unless it is incremented when rows are removed to display the number of deleted rows).
The number of rows in dataGridView4 should not be stored in filasDGV4 as this number changes.
Update
According to your comments and the new screenshots, you need two loops. (The code above only works if both lists are sorted). We could use two nested loops; however, this is slow. Therefore, I suggest collecting the unwanted roles in a HashSet<string> first. Testing whether an item is in a HashSet is extremely fast. Then we can loop through the rows of the dictionary and delete the unwanted ones.
var unwanted = new HashSet<string>();
for (int i = 0; i < dataGridView1.Rows.Count: i++)
{
unwanted.Add(dataGridView1["GRANTED_ROLE", i].Value.ToString());
}
int row = 0;
while (row < dataGridView4.Rows.Count)
{
string perfilVacio = dataGridView4["GRANTED_ROLE", row].Value.ToString();
if(unwanted.Contains(perfilVacio))
{
dataGridView4.Rows.RemoveAt(row);
}
else
{
row++;
}
}
Suggestion: Using data binding to bind your DataGridViews to generic lists would enable you to work on these lists instead of working on the DGVs. This would simplify the data handling considerably.
I have an API in C# that returns data from a DB and a frontend that paints that data in a table.
My approach was to read the data from the DB with an sqlReader, iterate through this reader adding each result to a list and return that list to the frontend.
Seems easy enough, until I receive massive query data. My solution was to return this data chunk by chunk but I'm stuck with it, this is the code I'm working with:
var sqlCommand = db.InitializeSqlCommand(query);
try
{
using (var reader = sqlCommand.ExecuteReader())
{
var results = new List<List<string>>();
var headers = new List<string>();
var rows = new List<string>();
for (var i = 0; i < reader.FieldCount; i++)
{
headers.Add(reader.GetName(i));
}
results.Add(headers);
while (reader.Read())
{
for (var i = 0; i < reader.FieldCount; i++)
{
rows.Add((reader[reader.GetName(i)]).ToString());
}
results.Add(rows);
var str = JsonConvert.SerializeObject(results);
var buffer = Encoding.UTF8.GetBytes(str);
//Thread.Sleep(1000);
await outputStream.WriteAsync(buffer, 0, buffer.Length);
rows.Clear();
results.Clear();
outputStream.Flush();
}
}
}
catch (HttpException ex)
{
if (ex.ErrorCode == -2147023667) // The remote host closed the connection.
{
}
}
finally
{
outputStream.Close();
db.Dispose();
}
With this, I'm able to return data one by one (tested with the Thread.sleep), but I'm stuck on how to return a specific amount, say 200 data or 1000, it really should not matter.
Any idea on how to proceed?
Thanks in advance.
Mese.
I think controlling the query is the better way since that is what will be fetched from the database. You can increase the OFFSET for every subsequent run. Example - after ORDER BY clause add OFFSET 200 ROWS FETCH NEXT 200 ROWS ONLY to skip 200 rows and get the next 200.
However since you've mentioned that you have no control on the query, then you can do something like this to filter our results on your end. The key trick here is to use reader.AsEnumerable.Skip(200).Take(200) to choose which rows to process. Update the input to Skip() in every iteration to process data accordingly.
// Offset variable will decide how many rows to skip, the outer while loop can be
// used to determine if more data is present and increment offset by 200 or any
// other value as required. Offset -> 0, 200, 400, 600, etc.. until data is present
bool hasMoreData = true;
int offset = 0;
while(hasMoreData)
{
// SQL Data reader and other important operations
foreach(var row in reader.AsEnumerable.Skip(offset).Take(200))
{
// Processing operations
}
// Check to ensure there are more rows
if(no more rows)
hasMoreData = false;
offset += 200;
}
Another thing to keep in mind is when you pull the data in batches, the query will execute multiple times and if during that time, a new record got added or deleted, then the batches will not function correctly. To get past this, you can do 2 things:
Validate a Unique ID of every record with unique ID's of already fetched records to make sure the same record isn't pulled twice (edge case due to record addition/deletion)
Add a buffer to your offset, such as
Skip(0).Take(100) // Pulls 0 - 100 records
Skip(90).Take(100) // Pulls 90 - 190 records (overlap of 10 to cater for additions/deletions)
Skip(180).Take(100) // Pulls 180 - 280 records (overlap of 10 to cater for additions/deletions)
and so on...
Hope this helps!
I have a really large table with about a 1,000,000 rows of data in a c# datatable and I would like to upload that into a mysql db table. What is the best and fastest way to do this ?
Looping through the rows and uploading one row at a time looks to be the really bad performance wise and also throws a timeout exception at times.
I know that one of the solutions is to write it out to a file and read it from file using mysqlbulkloader. Is there any other way this could be done directly from the data table to the database ?
A non-generic solution exists in the form of building a SQL query using StringBuilder. I have used a solution like this for MSSQL 2008, so it may prove useful for you.
string _insertQuery(IEnumerable<Item> datatable) {
sb.Append("INSERT INTO table (coltext, colnum, coltextmore) VALUES ");
foreach (var i in datatable) {
sb.AppendFormat("('{0}', {1}, '{2}'),",
new object[] { i.ColText, i.ColNum, i.ColTextMore });
}
sb.Remove(sb.Length - 1, 1);
return sb.ToString();
}
And you will (probably) need a way to page through the 1,000,000 rows:
var lst = new List<Item>();
// ...
for (int i = 0; i < datatable.Count; i += 1000) {
_insertQuery(lst.RangeOf(i, 1000);
}
RangeOf() is an IList extension I wrote that pages through the list:
public static IList<T> RangeOf<T>(this IList<T> src, int start, int length) {
var result = new List<T>();
for (int i = start; i < start + length; i++) {
result.Add(src[i]);
}
return result;
}
Iterating through a datatable that contains about 40 000 records using for-loop takes almost 4 minutes. Inside the loop I'm just reading the value of a specific column of each row and concatinating it to a string.
I'm not opening any DB connections or something, as its a function which recieves a datatable, iterate through it and returns a string.
Is there any faster way of doing this?
Code goes here:
private string getListOfFileNames(Datatable listWithFileNames)
{
string whereClause = "";
if (listWithFileNames.Columns.Contains("Filename"))
{
whereClause = "where filename in (";
for (int j = 0; j < listWithFileNames.Rows.Count; j++)
whereClause += " '" + listWithFileNames.Rows[j]["Filename"].ToString() + "',";
}
whereClause = whereClause.Remove(whereClause.Length - 1, 1);
whereClause += ")";
return whereClause;
}
Are you using a StringBuilder to concat the strings rather than just regular string concatenation?
Are you pulling back any more columns from the database then you really need to? If so, try not to. Only pull back the column(s) that you need.
Are you pulling back any more rows from the database then you really need to? If so, try not to. Only pull back the row(s) that you need.
How much memory does the computer have? Is it maxing out when you run the program or getting close to it? Is the processor at the max much or at all? If you're using too much memory then you may need to do more streaming. This means not pulling the whole result set into memory (i.e. a datatable) but reading each line one at a time. It also might mean that rather than concatting the results into a string (or StringBuilder ) that you might need to be appending them to a file so as to not take up so much memory.
Following linq statement have a where clause on first column and concat the third column in a variable.
string CSVValues = String.Join(",", dtOutput.AsEnumerable()
.Where(a => a[0].ToString() == value)
.Select(b => b[2].ToString()));
Step 1 - run it through a profiler, make sure you're looking at the right thing when optimizing.
Case in point, we had an issue we were sure was slow database interactions and when we ran the profiler the db barely showed up.
That said, possible things to try:
if you have the memory available, convert the query to a list, this
will force a full db read. Otherwise the linq will probably load in
chunks doing multiple db queries.
push the work to the db - if you can create a query than trims down
the data you are looking at, or even calculates the string for you,
that might be faster
if this is something where the query is run often but the data rarely
changes, consider copying the data to a local db (eg. sqlite) if
you're using a remote db.
if you're using the local sql-server, try sqlite, it's faster for
many things.
var value = dataTable
.AsEnumerable()
.Select(row => row.Field<string>("columnName"));
var colValueStr = string.join(",", value.ToArray());
Try adding a dummy column in your table with an expression. Something like this:
DataColumn dynColumn = new DataColumn();
{
dynColumn.ColumnName = "FullName";
dynColumn.DataType = System.Type.GetType("System.String");
dynColumn.Expression = "LastName+' '-ABC";
}
UserDataSet.Tables(0).Columns.Add(dynColumn);
Later in your code you can use this dummy column instead. You don't need to rotate any loop to concatenate a string.
Try using parallel for loop..
Here's the sample code..
Parallel.ForEach(dataTable.AsEnumerable(),
item => { str += ((item as DataRow)["ColumnName"]).ToString(); });
I've separated the job in small pieces and let each piece be handled by its own Thread. You can fine tune the number of thread by varying the nthreads number. Try it with different numbers so you can see the difference in performance.
private string getListOfFileNames(DataTable listWithFileNames)
{
string whereClause = String.Empty;
if (listWithFileNames.Columns.Contains("Filename"))
{
int nthreads = 8; // You can play with this parameter to fine tune and get your best time.
int load = listWithFileNames.Rows.Count / nthreads; // This will tell how many items reach thread mush process.
List<ManualResetEvent> mres = new List<ManualResetEvent>(); // This guys will help the method to know when the work is done.
List<StringBuilder> sbuilders = new List<StringBuilder>(); // This will be used to concatenate each bis string.
for (int i = 0; i < nthreads; i++)
{
sbuilders.Add(new StringBuilder()); // Create a new string builder
mres.Add(new ManualResetEvent(false)); // Create a not singaled ManualResetEvent.
if (i == 0) // We know were to put the very begining of your where clause
{
sbuilders[0].Append("where filename in (");
}
// Calculate the last item to be processed by the current thread
int end = i == (nthreads - 1) ? listWithFileNames.Rows.Count : i * load + load;
// Create a new thread to deal with a part of the big table.
Thread t = new Thread(new ParameterizedThreadStart((x) =>
{
// This is the inside of the thread, we must unbox the parameters
object[] vars = x as object[];
int lIndex = (int)vars[0];
int uIndex = (int)vars[1];
ManualResetEvent ev = vars[2] as ManualResetEvent;
StringBuilder sb = vars[3] as StringBuilder;
bool coma = false;
// Concatenate the rows in the string builder
for (int j = lIndex; j < uIndex; j++)
{
if (coma)
{
sb.Append(", ");
}
else
{
coma = true;
}
sb.Append("'").Append(listWithFileNames.Rows[j]["Filename"]).Append("'");
}
// Tell the parent Thread that your job is done.
ev.Set();
}));
// Start the thread with the calculated params
t.Start(new object[] { i * load, end, mres[i], sbuilders[i] });
}
// Wait for all child threads to finish their job
WaitHandle.WaitAll(mres.ToArray());
// Concatenate the big string.
for (int i = 1; i < nthreads; i++)
{
sbuilders[0].Append(", ").Append(sbuilders[i]);
}
sbuilders[0].Append(")"); // Close your where clause
// Return the finished where clause
return sbuilders[0].ToString();
}
// Returns empty
return whereClause;
}
I'm iterating through a smallish (~10GB) table with a foreach / IQueryable and LINQ-to-SQL.
Looks something like this:
using (var conn = new DbEntities() { CommandTimeout = 600*100})
{
var dtable = conn.DailyResults.Where(dr => dr.DailyTransactionTypeID == 1);
foreach (var dailyResult in dtable)
{
//Math here, results stored in-memory, but this table is very small.
//At the very least compared to stuff I already have in memory. :)
}
}
The Visual Studio debugger throws an out-of memory exception after a short while at the base of the foreach loop. I'm assuming that the rows of dtable are not being flushed. What to do?
The IQueryable<DailyResult> dtable will attempt to load the entire query result into memory when enumerated... before any iterations of the foreach loop. It does not load one row during the iteration of the foreach loop. If you want that behavior, use DataReader.
You call ~10GB smallish? you have a nice sense of humor!
You might consider loading rows in chunks, aka pagination.
conn.DailyResults.Where(dr => dr.DailyTransactionTypeID == 1).Skip(x).Take(y);
Using DataReader is a step backward unless there is a way to use it within LINQ. I thought we were trying to get away from ADO.
The solution suggested above works, but it's truly ugly. Here is my code:
int iTake = 40000;
int iSkip = 0;
int iLoop;
ent.CommandTimeout = 6000;
while (true)
{
iLoop = 0;
IQueryable<viewClaimsBInfo> iInfo = (from q in ent.viewClaimsBInfo
where q.WorkDate >= dtStart &&
q.WorkDate <= dtEnd
orderby q.WorkDate
select q)
.Skip(iSkip).Take(iTake);
foreach (viewClaimsBInfo qInfo in iInfo)
{
iLoop++;
if (lstClerk.Contains(qInfo.Clerk.Substring(0, 3)))
{
/// Various processing....
}
}
if (iLoop < iTake)
break;
iSkip += iTake;
}
You can see that I have to check for having run out of records because the foreach loop will end at 40,000 records. Not good.
Updated 6/10/2011: Even this does not work. At 2,000,000 records or so, I get an out-of-memory exception. It is also excruciatingly slow. When I modified it to use OleDB, it ran in about 15 seconds (as opposed to 10+ minutes) and didn't run out of memory. Does anyone have a LINQ solution that works and runs quickly?
Use .AsNoTracking() - it tells DbEntities not to cache retrieved rows
using (var conn = new DbEntities() { CommandTimeout = 600*100})
{
var dtable = conn.DailyResults
.AsNoTracking() // <<<<<<<<<<<<<<
.Where(dr => dr.DailyTransactionTypeID == 1);
foreach (var dailyResult in dtable)
{
//Math here, results stored in-memory, but this table is very small.
//At the very least compared to stuff I already have in memory. :)
}
}
I would suggest using SQL instead to modify this data.