C#: Iterating through a huge datatable - c#

Iterating through a datatable that contains about 40 000 records using for-loop takes almost 4 minutes. Inside the loop I'm just reading the value of a specific column of each row and concatinating it to a string.
I'm not opening any DB connections or something, as its a function which recieves a datatable, iterate through it and returns a string.
Is there any faster way of doing this?
Code goes here:
private string getListOfFileNames(Datatable listWithFileNames)
{
string whereClause = "";
if (listWithFileNames.Columns.Contains("Filename"))
{
whereClause = "where filename in (";
for (int j = 0; j < listWithFileNames.Rows.Count; j++)
whereClause += " '" + listWithFileNames.Rows[j]["Filename"].ToString() + "',";
}
whereClause = whereClause.Remove(whereClause.Length - 1, 1);
whereClause += ")";
return whereClause;
}

Are you using a StringBuilder to concat the strings rather than just regular string concatenation?
Are you pulling back any more columns from the database then you really need to? If so, try not to. Only pull back the column(s) that you need.
Are you pulling back any more rows from the database then you really need to? If so, try not to. Only pull back the row(s) that you need.
How much memory does the computer have? Is it maxing out when you run the program or getting close to it? Is the processor at the max much or at all? If you're using too much memory then you may need to do more streaming. This means not pulling the whole result set into memory (i.e. a datatable) but reading each line one at a time. It also might mean that rather than concatting the results into a string (or StringBuilder ) that you might need to be appending them to a file so as to not take up so much memory.

Following linq statement have a where clause on first column and concat the third column in a variable.
string CSVValues = String.Join(",", dtOutput.AsEnumerable()
.Where(a => a[0].ToString() == value)
.Select(b => b[2].ToString()));

Step 1 - run it through a profiler, make sure you're looking at the right thing when optimizing.
Case in point, we had an issue we were sure was slow database interactions and when we ran the profiler the db barely showed up.
That said, possible things to try:
if you have the memory available, convert the query to a list, this
will force a full db read. Otherwise the linq will probably load in
chunks doing multiple db queries.
push the work to the db - if you can create a query than trims down
the data you are looking at, or even calculates the string for you,
that might be faster
if this is something where the query is run often but the data rarely
changes, consider copying the data to a local db (eg. sqlite) if
you're using a remote db.
if you're using the local sql-server, try sqlite, it's faster for
many things.

var value = dataTable
.AsEnumerable()
.Select(row => row.Field<string>("columnName"));
var colValueStr = string.join(",", value.ToArray());

Try adding a dummy column in your table with an expression. Something like this:
DataColumn dynColumn = new DataColumn();
{
dynColumn.ColumnName = "FullName";
dynColumn.DataType = System.Type.GetType("System.String");
dynColumn.Expression = "LastName+' '-ABC";
}
UserDataSet.Tables(0).Columns.Add(dynColumn);
Later in your code you can use this dummy column instead. You don't need to rotate any loop to concatenate a string.

Try using parallel for loop..
Here's the sample code..
Parallel.ForEach(dataTable.AsEnumerable(),
item => { str += ((item as DataRow)["ColumnName"]).ToString(); });

I've separated the job in small pieces and let each piece be handled by its own Thread. You can fine tune the number of thread by varying the nthreads number. Try it with different numbers so you can see the difference in performance.
private string getListOfFileNames(DataTable listWithFileNames)
{
string whereClause = String.Empty;
if (listWithFileNames.Columns.Contains("Filename"))
{
int nthreads = 8; // You can play with this parameter to fine tune and get your best time.
int load = listWithFileNames.Rows.Count / nthreads; // This will tell how many items reach thread mush process.
List<ManualResetEvent> mres = new List<ManualResetEvent>(); // This guys will help the method to know when the work is done.
List<StringBuilder> sbuilders = new List<StringBuilder>(); // This will be used to concatenate each bis string.
for (int i = 0; i < nthreads; i++)
{
sbuilders.Add(new StringBuilder()); // Create a new string builder
mres.Add(new ManualResetEvent(false)); // Create a not singaled ManualResetEvent.
if (i == 0) // We know were to put the very begining of your where clause
{
sbuilders[0].Append("where filename in (");
}
// Calculate the last item to be processed by the current thread
int end = i == (nthreads - 1) ? listWithFileNames.Rows.Count : i * load + load;
// Create a new thread to deal with a part of the big table.
Thread t = new Thread(new ParameterizedThreadStart((x) =>
{
// This is the inside of the thread, we must unbox the parameters
object[] vars = x as object[];
int lIndex = (int)vars[0];
int uIndex = (int)vars[1];
ManualResetEvent ev = vars[2] as ManualResetEvent;
StringBuilder sb = vars[3] as StringBuilder;
bool coma = false;
// Concatenate the rows in the string builder
for (int j = lIndex; j < uIndex; j++)
{
if (coma)
{
sb.Append(", ");
}
else
{
coma = true;
}
sb.Append("'").Append(listWithFileNames.Rows[j]["Filename"]).Append("'");
}
// Tell the parent Thread that your job is done.
ev.Set();
}));
// Start the thread with the calculated params
t.Start(new object[] { i * load, end, mres[i], sbuilders[i] });
}
// Wait for all child threads to finish their job
WaitHandle.WaitAll(mres.ToArray());
// Concatenate the big string.
for (int i = 1; i < nthreads; i++)
{
sbuilders[0].Append(", ").Append(sbuilders[i]);
}
sbuilders[0].Append(")"); // Close your where clause
// Return the finished where clause
return sbuilders[0].ToString();
}
// Returns empty
return whereClause;
}

Related

How can I pass an unkown number of arguments to C#'s "Database.Open("DatabaseName").Query()" method?

I have tried Googling to find a solution, but all I get is totally irrelevant results or results involving 2 dimensional arrays, like the one here: http://social.msdn.microsoft.com/Forums/vstudio/en-US/bb4d54d3-14d7-49e9-b721-db4501db62c8/how-does-one-increment-a-value-in-a-two-dimensional-array, which does not apply.
Say I have this declared:
var db = Database.Open("Content");
var searchTerms = searchText.Split('"').Select((element, index) => index % 2 == 0 ? element.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries) : new string[] { element }).SelectMany(element => element).ToList();
int termCount = searchTerms.Count;
(Note: All that you really need to know about searchTerms is that it holds a number of search terms typed into a search bar by the user. All the LINQ expression is doing is ensuring that text wrapped in qoutes is treated as a single search term. It is not really necessary to know all of this for the purpose of this question.)
Then I have compiled (using for loops that loop for each number of items in the searchTerms list) a string to be used as a SELECT SQL query.
Here is an example that shows part of this string being compiled with the #0, #1, etc. placeholders so that my query is parameterized.
searchQueryString = "SELECT NULL AS ObjectID, page AS location, 'pageSettings' AS type, page AS value, 'pageName' AS contentType, ";
for (int i=0; i<termCount; i++)
{
if(i != 0)
{
searchQueryString += "+ ";
}
searchQueryString += "((len(page) - len(replace(UPPER(page), UPPER(#" + i + "), ''))) / len(#" + i + ")) ";
}
searchQueryString += "AS occurences ";
(Note: All that you really need to know about the above code is that I am concatenating the incrementing value of i to the # symbol to dynamically compile the placeholder value.)
All of the above works fine, but later, I must use something along the lines of this (only I don't know how many arguments I will need until runtime):
foreach (var row in db.Query(searchQueryString, searchTerms[0]))
{
#row.occurences
}
(For Clarification: I will need a number of additional arguments (i.e., in addition to the searchQueryString argument) equal to the number of items in the searchTerms list AND they will have to be referencing the correct index (effectively referencing each index from lowest to highest, in order, separated by commas, of course.)
Also, I will, of course need to use an incrementing value to reference the appropriate index of the list, if I can even get that far, and I don't know how to do that either. Could I use i++ somehow for that?
I know C# is powerful, but maybe I am asking too much?
Use params keyword for variable numbers of parameters. With params, the arguments passed to a any function are changed by the compiler to elements in a temporary array.
static int AddParameters(params int[] values)
{
int total = 0;
foreach (int value in values)
{
total += value;
}
return total;
}
and can be called as
int add1 = AddParameters(1);
int add2 = AddParameters(1, 2);
int add3 = AddParameters(1, 2, 3);
int add4 = AddParameters(1, 2, 3, 4);
//-----------Edited Reply based on comments below---
You can use something like this to be used with SQL
void MYSQLInteractionFunction(String myConnectionString)
{
String searchQueryString = "SELECT NULL AS ObjectID, page AS location, 'pageSettings' AS type, page AS value, 'pageName' AS contentType, ";
SqlConnection myConnection = new SqlConnection(myConnectionString);
SqlCommand myCommand = new SqlCommand(searchQueryString, myConnection);
myConnection.Open();
SqlDataReader queryCommandReader = myCommand.ExecuteReader();
// Create a DataTable object to hold all the data returned by the query.
DataTable dataTable = new DataTable();
// Use the DataTable.Load(SqlDataReader) function to put the results of the query into a DataTable.
dataTable.Load(queryCommandReader);
Int32 rowID = 0; // or iterate on your Rows - depending on what you want
foreach (DataColumn column in dataTable.Columns)
{
myStringList.Add(dataTable.Rows[rowID][column.ColumnName] + " | ");
rowID++;
}
myConnection.Close();
String[] myStringArray = myStringList.ToArray();
UnlimitedParameters(myStringArray);
}
static void UnlimitedParameters(params string[] values)
{
foreach (string strValue in values)
{
// Do whatever you want to do with this strValue
}
}
I'm not sure I quite understand what you need from the question, but it looks like you're substituting a series of placeholders in the SQL with another value. If that's the case, you can use String.Format to replace the values like this:
object val = "a";
object anotherVal = 2.0;
var result = string.Format("{0} - {1}", new[] { val, anotherVal });
This way, you can substitute as many values as you need by simply creating the arguments array to be the right size.
If you're creating a SQL query on the fly, then you need to be wary of SQL injection, and substituting user-supplied text directly into a query is a bit of a no-no from this point of view. The best way to avoid this is to use parameters in the query, which automatically then get sanitised to prevent SQL injection. You can still use a 'params array' argument though, to achieve what you need, for example:
public IDataReader ExecuteQuery(string sqlQuery, params string[] searchTerms)
{
var cmd = new SqlCommand { CommandText = sqlQuery };
for (int i = 0; i < searchTerms.Length; i++)
{
cmd.Parameters.AddWithValue(i.ToString(), searchTerms[i]);
}
return cmd.ExecuteReader();
}
obviously, you could also build up the sql string within this method if you needed to, based on the length of the searchTerms array
Well, for how complex the question probably seemed, the answer ended up being something pretty simple. I suppose it was easy to overlook because I had never done this before and others may have thought it too obvious to be what I was looking for. However, I had NEVER tried to pass a variable length of parameters before and had no clue if it was even possible or not (okay, well I guess I knew it was possible somehow, but could have been very far from my method for all I knew).
In any case, I had tried:
foreach (var row in db.Query(searchQueryString, searchTerms)) //searchTerms is a list of strings.
{
//Do something for each row returned from the sql query.
}
Assuming that if it could handle a variable length number of arguments (remember each argument passed after the first in the Database.Query() method is treated as fillers for the placeholders in the query string (e.g., #0, #1, #2, etc.) it could accept it from a list if it could from an array.
I couldn't really have been any more wrong with that assumption, as passing a list throws an error. However, I was surprised when I finally broke down, converted the list to an array, and tried passing the array instead of the list.
Indeed, and here is the short answer to my original question, if I simply give it an array it works easily (a little too easily, which I suppose is why I was so sure it wouldn't work):
string[] searchTermsArray = searchTerms.ToArray();
foreach (var row in db.Query(searchQueryString, searchTermsArray)) //searchTermsArray is an array of strings.
{
//Do something for each row returned from the sql query.
}
The above snippet of code is really all that is needed to successfully answer my original question.

Alternative to Recordset Looping

Back in the day using ADO, we used GetRows() to pull back an array and loop through it, because it was faster than using rs.MoveNext to walk through records. I'm writing an application that pulls back half a million rows and writes them out into a file. Pulling the data from SQL takes about 3 minutes, but writing it to a CSV is taking another 12 minutes. From the looks of it, it's because I'm looping through a SqlDataReader. What is a faster alternative?
Keep in mind, I do not know what the SQL Structure will look like as this is calling a reporting table that tells my application what query should be called. I looked at using linq and return an array, but that will require knowing the structure, so that will not work.
Note the below code, case statement has many cases, but to cut down on space, I removed them all, except one.
StringBuilder rowValue = new StringBuilder();
SqlDataReader reader = queryData.Execute(System.Data.CommandType.Text, sql, null);
//this is to handle multiple record sets
while (reader.HasRows)
{
for (int i = 0; i < reader.FieldCount; i++)
{
if (rowValue.Length > 0)
rowValue.Append("\",\"");
else
rowValue.Append("\"");
rowValue.Append(reader.GetName(i).Replace("\"", "'").Trim());
}
rowValue.Append("\"" + Environment.NewLine);
File.AppendAllText(soureFile, rowValue.ToString());
while (reader.Read())
{
rowValue = new StringBuilder();
for (int i = 0; i < reader.FieldCount; i++)
{
String value = "";
switch (reader.GetFieldType(i).Name.ToLower())
{
case "int16":
value = reader.IsDBNull(i) ? "NULL" : reader.GetInt16(i).ToString();
break;
}
if (rowValue.Length > 0)
rowValue.Append("\",=\""); //seperate items
else
rowValue.Append("\""); //first item of the row.
rowValue.Append(value.Replace("\"", "'").Trim());
}
rowValue.Append("\"" + Environment.NewLine); //last item of the row.
File.AppendAllText(soureFile, rowValue.ToString());
}
//next record set
reader.NextResult();
if (reader.HasRows)
File.AppendAllText(soureFile, Environment.NewLine);
}
reader.Close();
The problem here is almost certainly that you are calling File.AppendAllText() for every row. Since AppendAllText opens, writes, then closes the file every time it is called, it can get quite slow.
A better way would be either to use the AppendText() method or else an explicit StreamWriter.

The loop is only looping Once, please help find solution C#, MySQL

I am trying to execute mysql query inside a loop where it get new results every time. The problem I am facing is that, the loop is looping successfully only for the first time, second time it says.That means if it has the idcount as 5 it only goes to 1 and when it enters 2 the error appears.
Invalid attempt to access a field before calling Read ()
it is on this line "string result2 = mysqlReader5[0].ToString();"
It would be nice of you if you can help me to make this a successful loop
Thanx in Advance.
for (int i = 0; i < idcount; i++)
{
connection.Open();
string x = idarray[i];
ImageLoop img = new ImageLoop();
image[i] =img.imageloop(x);
MySqlCommand mysqlCmd5 = new MySqlCommand("SELECT image FROM useralbum where user_id='" + x + "' LIMIT 0,1;", connection);
MySqlDataReader mysqlReader5 = mysqlCmd5.ExecuteReader();
while (mysqlReader5.Read())
{
}
string result2 = mysqlReader5[0].ToString();
image[i] = result2;
connection.Close();
}
You are accessing the reader outside the while loop, it should be inside. Like:
while (mysqlReader5.Read())
{
string result2 = mysqlReader5[0].ToString();
image[i] = result2;
}
Also the assignment to image[i] should be done inside the while loop.
You should be accessing the value inside the while block
while (mysqlReader5.Read())
{
// this block is getting executed while there are records
}
string result2 = mysqlReader5[0].ToString();
Also it would better if you could use Using blocks and shift the for loop inside, so that you avoid the opening and close of the connection each time.
Instead of using while, first check the value of mysqlReader5.Read()
if (mysqlReader5.Read())
{
string result2 = mysqlReader5[0].ToString();
}
I think you need to change your implementation....try this link for reference
http://csharp-station.com/Tutorial/AdoDotNet/Lesson04
datareader[0] returns the first column value.it is not used to access the array of results (Use data table for that).And data reader does not return an array of results and it only fetch one result at a time.So u need to read each data inside the while (mysqlReader5.Read()).I hope it helps!!!!

Iterating through IQueryable with foreach results in an out of memory exception

I'm iterating through a smallish (~10GB) table with a foreach / IQueryable and LINQ-to-SQL.
Looks something like this:
using (var conn = new DbEntities() { CommandTimeout = 600*100})
{
var dtable = conn.DailyResults.Where(dr => dr.DailyTransactionTypeID == 1);
foreach (var dailyResult in dtable)
{
//Math here, results stored in-memory, but this table is very small.
//At the very least compared to stuff I already have in memory. :)
}
}
The Visual Studio debugger throws an out-of memory exception after a short while at the base of the foreach loop. I'm assuming that the rows of dtable are not being flushed. What to do?
The IQueryable<DailyResult> dtable will attempt to load the entire query result into memory when enumerated... before any iterations of the foreach loop. It does not load one row during the iteration of the foreach loop. If you want that behavior, use DataReader.
You call ~10GB smallish? you have a nice sense of humor!
You might consider loading rows in chunks, aka pagination.
conn.DailyResults.Where(dr => dr.DailyTransactionTypeID == 1).Skip(x).Take(y);
Using DataReader is a step backward unless there is a way to use it within LINQ. I thought we were trying to get away from ADO.
The solution suggested above works, but it's truly ugly. Here is my code:
int iTake = 40000;
int iSkip = 0;
int iLoop;
ent.CommandTimeout = 6000;
while (true)
{
iLoop = 0;
IQueryable<viewClaimsBInfo> iInfo = (from q in ent.viewClaimsBInfo
where q.WorkDate >= dtStart &&
q.WorkDate <= dtEnd
orderby q.WorkDate
select q)
.Skip(iSkip).Take(iTake);
foreach (viewClaimsBInfo qInfo in iInfo)
{
iLoop++;
if (lstClerk.Contains(qInfo.Clerk.Substring(0, 3)))
{
/// Various processing....
}
}
if (iLoop < iTake)
break;
iSkip += iTake;
}
You can see that I have to check for having run out of records because the foreach loop will end at 40,000 records. Not good.
Updated 6/10/2011: Even this does not work. At 2,000,000 records or so, I get an out-of-memory exception. It is also excruciatingly slow. When I modified it to use OleDB, it ran in about 15 seconds (as opposed to 10+ minutes) and didn't run out of memory. Does anyone have a LINQ solution that works and runs quickly?
Use .AsNoTracking() - it tells DbEntities not to cache retrieved rows
using (var conn = new DbEntities() { CommandTimeout = 600*100})
{
var dtable = conn.DailyResults
.AsNoTracking() // <<<<<<<<<<<<<<
.Where(dr => dr.DailyTransactionTypeID == 1);
foreach (var dailyResult in dtable)
{
//Math here, results stored in-memory, but this table is very small.
//At the very least compared to stuff I already have in memory. :)
}
}
I would suggest using SQL instead to modify this data.

How to write this in better way?

Let's look at this code:
IList<IHouseAnnouncement> list = new List<IHouseAnnouncement>();
var table = adapter.GetData(); //get data from repository object -> DataTable
if (table.Rows.Count >= 1)
{
for (int i = 0; i < table.Rows.Count; i++)
{
var anno = new HouseAnnouncement();
anno.Area = float.Parse(table.Rows[i][table.areaColumn].ToString());
anno.City = table.Rows[i][table.cityColumn].ToString();
list.Add(anno);
}
}
return list;
Is it better way to write this in less code and better fashion (must be :-) )? Maybe using lambda (but let me know how)?
Thanks in advance!
Just FYI, you're never adding the new HouseAnnouncement to your list, and your loop will never execute for the last row, but I'm assuming those are errors in the example rather than in your actual code.
You could do something like this:
return adapter.GetData().Rows.Cast<DataRow>().Select(row =>
new HouseAnnouncement()
{
Area = Convert.ToSingle(row["powierzchnia"]),
City = (string)row["miasto"],
}).ToList();
I usually go for readability over brevity, but I feel like this is pretty readable.
Note that while you could still cache the DataTable and use table.powierzchniaColumn in the lambda, I eliminated that so that you didn't use a closure that wasn't really necessary (closures introduce substantial complexity to the internal implementation of the lambda, so I avoid them if possible).
If it's important to you to keep the column references as they are, then you can do it like this:
using (var table = adapter.GetData())
{
return table.Rows.Cast<DataRow>().Select(row =>
new HouseAnnouncement()
{
Area = Convert.ToSingle(row[table.powierzchniaColumn]),
City = (string)row[table.miastoColumn],
}).ToList();
}
This will add complexity to the actual IL that the compiler generates, but should do the trick for you.
You could do something like this in Linq:
var table = adapter.GetData();
var q = from row in table.Rows.Cast<DataRow>()
select new HouseAnnouncement()
{ Area = float.Parse(row[table.areaColumn].ToString()),
City = row[table.cityColumn].ToString()
};
return q.ToList();
Your "if statement" is not necessary. Your "for loop" already takes care of that case.
Also, your "for loop" will not execute when the number of your Table Rows is 1. This seems like a mistake, and not by design, but I could be wrong. If you want to fix this, just take out the "-1":
for (int i = 0; i < table.Rows.Count; i++)
Well, for one thing, you appear to have an off-by-one error:
for (int i = 0; i < table.Rows.Count - 1; i++)
{
}
If your table has three rows, this will run while i is less than 3 - 1, or 2, which means it'll run for rows 0 and 1 but not for row 2. This may not be what you intend.
Can't go much simpler that one for-loop and no if-statements:
var table = adapter.GetData(); //get data from repository object -> DataTable
IList<IHouseAnnouncement> list = new List<IHouseAnnouncement>(table.Rows.Count);
for (int i = 0; i < list.Length; i++)
{
list[i] = new HouseAnnouncement();
list[i].Area = float.Parse(table.Rows[i][table.areaColumn].ToString());
list[i].City = table.Rows[i][table.cityColumn].ToString();
}
return list;
It takes more characters than linq-version, but is parsed faster by programmer's brain. :)
Readability is, to me, preferable to being succinct with your code--as long as performance is not a victim. Also, I am sure that anyone who later has to maintain the code will appreciate it as well.
Even when I am maintaining my own code, I don't want to look at it, say a couple of months later, and think "what the hell was I trying to accomplish"
I might do something like this:
var table = adapter.GetData(); //get data from repository object -> DataTable
return table.Rows.Take(table.Rows.Count-1).Select(row => new HouseAnnouncement() {
Area = float.Parse(row[table.powierzchniaColumn].ToString()),
City = row[table.miastoColumn].ToString()
}).ToList();

Categories