DataTable memory huge consumption - c#

I´m loading csv data from files into a datatable for processing.
The problem is, that I want to process several files and my tests with the datatable shows me huge memory consumption
I tested with a 37MB csv file and the memory growed up to 240MB, which is way to much IMHO.
I read, that there is overhead in the datatable and I could live with about 70MB in size , but not 240MB, which means it is six times the original size.
I read here, that datatables need more memory than POCOs, but that the difference is way too much.
I put on a memory profiler and looked, if I have memory leaks and where the memory is. I found, that the datatablecolumns have between 6MB and 19MB filled with strings and the datatable had about 20 columns. Are the values stored in the columns? Why is so much memory taken, what can I do to reduce memory consumption.
With this memory consumption datattables seem to be unusable.
Had somebody else such problems with datatables, or I´m doing something wrong?
PS: I tried a 70MB file and the datatable growed up to 500MB!
OK here is a small testcase:
The 37MB csv-file (21 columns) let the memory grow up to 179MB.
private static DataTable ReadCsv()
{
DataTable table = new DataTable();
table.BeginLoadData();
using (var reader = new StreamReader(File.OpenRead(#"C:\Develop\Tests\csv-Data\testdaten\test.csv")))
{
int y = 0;
int columnsCount = 0;
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
var values = line.Split(',');
if (y == 0)
{
columnsCount = values.Count();
// create columns
for (int x = 0; x < columnsCount; x++)
{
table.Columns.Add(new DataColumn(values[x], typeof(string)));
}
}
else
{
if (values.Length == columnsCount)
{
// add the data
table.Rows.Add(values);
}
}
y++;
}
table.EndLoadData();
table.AcceptChanges();
}
return table;
}

DataSet and its children DataTable, DataRow, etc. make up an in-memory relational database. There is a lot of overhead involved (though it does make [some] things very convenient.
If memory is an issue,
Build domain objects to represent each row in your CSV file with typed properties.
Create a custom collection (or just use IList<T> to hold them
Alternatively, build a light-weight class with the basic semantics of a DataTable:
the ability to select a row by number
the ability to select a column within a row by row number and either column name or number.
The ability to know the ordered set of column names
Bonus: The ability to select a column by name or ordinal number and receive a list of its values, one per row.
Are you sure you need an in-memory representation of your CSV files? Could you access them via an IDataReader like Sebastien Lorion's Fast CSV Reader?

DataTables are a generic solution of putting tablular data into memory and adding lots of table-related features. If the overhead is not acceptable for you have the option to 1) write your own DataTable class that eliminates the overhead that you don't need 2) Use an alternate representation that still accomplishes what you need, perhaps POCO based, or maybe an XMLDocument (May have just as much overhead maybe more, never really worried about it). 3) Stop trying to load everything into memory and just bring data in as needed from your external store.

Related

Why does matching a value of a column in a datagridview row by row consume a lot of memory?

I have a DataGridview object called dgv whereby its DataSource property is set to a BindingSource object which in turn points to a DataTable object called ds.DataTable1. DataTable1 has a Name column.
Firstly, I had it load 700,000 fully-filled rows by using tableAdapter.Fill(ds.dataTable1). Then I run a search using a for-loop like so:
int rowCount = dgv.Rows.Count;
string searchName = "zelda";
// Search next record
for (int i = 0; i < rowCount; i++)
{
if (dgv.Rows[i].Cells[column1.Name].Value.ToString() == searchName.ToUpper())
{
break;
}
}
I realized that this search uses more than the amount of memory used to Fill in the dataset. To illustrate, when I initialize the app, it consumes just around 50 MB. When I Fill the dataset, memory consumption reaches 600 MB. When I run the search, the memory usage reached 1.5 GB before it encounters the out-of-memory error as my app is a 32-bit one.
I have found out that this line
if (dgv.Rows[i].Cells[column1.Name].Value.ToString() == searchName.ToUpper())
is the culprit.
Any idea why calling this line for hundreds of thousands of time consume such a large amount of memory? I understand that this datagridview consumes lots of memory as it needs to hold large amount of data but do not really understand why the process of looping and searching for a match in every row in the DataGridview causes even larger memory consumption.
I am fairly new to memory management so would appreciate some references to resources that explain this.
For each row, you're calling the full row in the entire datagridview, while you only need one column for it.
You can shorten up the call by making a list that contains only the column you're searching for. Perhaps it'll cost less memory to make the dgv into one column, and then check for it's values afterwards.
this query can help you to recieve only the column values:
List<string> columnvalues = new List<string>();
foreach (DataGridViewRow row in dataGridView1.Rows)
{
columnvalues.Add(row.Cells[column1.Name].Value.ToString());
}
if (columnvalues.Contains(searchName.ToUpper())
{
break;
}
Source: https://stackoverflow.com/a/21314241/2735344

How to export more than 1 million rows from SQL Server table to CSV in C# web app?

I am trying to export a SQL Server table with 1 million rows and 45 columns to a .csv file for the user to download via the web interface but it takes so long that I eventually have to stop the process manually.
I use a SqlDataReader and write into the file as the reader reads to avoid memory problems. The code works for small tables (less than 3k rows) but the large one keeps running and the destination file stays at 0 KB.
using (spContentConn) { using (var sdr = sqlcmd.ExecuteReader())
using (CsvfileWriter)
{
DataTable Tablecolumns = new DataTable();
for (int i = 0; i < sdr.FieldCount; i++)
{
Tablecolumns.Columns.Add(sdr.GetName(i));
}
CsvfileWriter.WriteLine(string.Join("~", Tablecolumns.Columns.Cast<DataColumn>().Select(csvfile => csvfile.ColumnName)));
while (sdr.Read())
for (int j = Tablecolumns.Columns.Count; j > 0; j--)
{
if (j == 1)
CsvfileWriter.WriteLine("");
else
CsvfileWriter.Write(sdr[Tablecolumns.Columns.Count - j].ToString() + "~");
}
}
I used the same answer recommended in this thread but still doesn't work. Please help. export large datatable data to .csv file in c# windows applications
It is not clear from the .NET documentation whether FileWriter has efficient buffering, therefore I always use a BufferedStream instead when I need to read/write large volumes of data. With a stream, you would have to write byte data instead of strings, but that requires only a minor adaptation of your code.
It also looks like you are reading and writing the columns of a DataTable in a loop, which would affect performance. Since the number and order of the columns would not change during an export operation, consider using the positional index to access the column values instead. It would also be better to write one row at a time instead of one column at a time.
Finally, you are using a data-reader, so that should provide the best throughput of data from your SQL Server (limited by your server and bandwidth, obviously). This would also suggest that the performance bottleneck is in the way that your data is being written to file.
For comparison, I just wrote 1,000,000 rows of 45 columns to a text file in under 60 seconds. Granted that my code does not read from a database, but that should still provide a good enough baseline for you.

What is the fastest way to populate a C# DataTable with data stored on columns?

I have a DataTable object that I need to fill based on data stored in a stream of columns - i.e. the stream initially contains the schema of the DataTable, and subsequently, values that should go into it organised by column.
At present, I'm taking the rather naive approach of
Create enough empty rows to hold all data values.
Fill those rows per cell.
The result is a per-cell iteration, which is not especially quick to say the least.
That is:
// Create rows first...
// Then populate...
foreach (var col in table.Columns.Cast<DataColumn>)
{
List<object> values = GetValuesfromStream(theStream);
// Actual method has some DBNull checking here, but should
// be immaterial to any solution.
for (var i=0; i<values.Count; i++)
table.Rows[i][col] = values[i];
}
My guess is the backing DataStorage items for each column aren't expanding as the rows are added, but as values are added to each column, but I'm far from certain. Any tips for loading this kind of data.
NB that loading all lists first and then reading in by row is probably not sensible - this approach is being taken in the first place to mitigate potential out of memory exceptions that tend to result when serializing huge DataTable objects, so grabbing a clone of the entire data grid and reading it in would probably just move the problem elsewhere. There's definitely enough memory for the original table and another column of values, but there probably isn't for two copies of the DataTable.
Whilst I haven't found a way to avoid iterating cells, as per the comments above, I've found that writing to DataRow items that have already been added to the table turns out to be a bad idea, and was responsible for the vast majority of the slowdown I observed.
The final approach I used ended up looking something like this:
List<DataRow> rows = null;
// Start population...
var cols = table.Columns.Cast<DataColumn>.Where(c => string.IsNullOrEmpty(c.Expression));
foreach (var col in cols)
{
List<object> values = GetValuesfromStream(theStream);
// Create rows first if required.
if (rows == null)
{
rows = new List<DataRow>();
for (var i=0; i<values.Count; i++)
rows.Add(table.NewRow());
}
// Actual method has some DBNull checking here, but should
// be immaterial to any solution.
for (var i=0; i<values.Count; i++)
rows[i][col] = values[i];
}
rows.ForEach(r => table.Rows.Add(r));
This approach addresses two problems:
If you try to add an empty DataRow to a table that has null-restrictions or similar, then you'll get an error. This approach ensures all the data is there before it's added, which should address most such issues (although I haven't had need to check how it works with auto-incrementing PK columns).
Where expressions are involved, these are evaluated when row state changes for a row that has been added to a table. Consequently, where before I had re-calculation of all expressions taking place every time a value was added to a cell (expensive and pointless), now all calculation takes place just once after all base data has been added.
There may of course be other complications with writing to a table that I've not yet encountered because the tables I am making use of don't use those features of the DataTable class/model. But for simple cases, this works well.

Managed Esent - quicker way to read data?

Using the Managed Esent interface to read data from a table. I am doing this with (pseudo):
List<ColumnInfo> columns; //Three columns to be read
using (var table = Table(session,DBID,"tablename",OpenTableGrbit.Readonly))
{
while (Api.TryMoveNext(session, table))
{
foreach (ColumnInfo col in columns)
{
string data = GetFormattedColumnData(session,table,col);
}
}
}
I am interested in data from three columns only, which is around 4,000 rows. However, the table itself is 1,800,000 rows. Hence this approach is very slow to just read the data I want as I need to read all 1,800,000 rows. Is there a quicker way?
There are many things you can do. Here are a few things off the top of my head:
Set the minimum cache size SystemParameters.CacheSizeMin. The default cache sizing algorithm is a bit conservative sometimes.
Also set OpenTableGrbit.Squential when opening your table. This helps a little bit with prefetching.
Use Api.RetrieveColumns to retrieve the three values at once. This reduces the number of calls/pinvokes you'll do.
-martin

A way out from getting SystemOutOfMemoryException while importing from large text file into database

we're using ZyWall to guard our servers from external intrusions. It generates daily log files with huge sizes, over a GB, sometimes 2 GBs. They ususally contain more than 10 millions of lines. Now my task is to write an application that will import these lines into Oracle database. I'm writing it in C#. What I'm currently doing is:
I read the logfiles line by line. I do not load the whole file at once:
using(StreamReader reader=new StreamReader("C:\ZyWall.log"))
{
while ((line=reader.ReadLine())!=null)
......
}
Every line read I split the line into parts according to the commas in it.
string[] lines = line.Split(new Char[] { ',' }, 10);
Then I iterate through the lines array, create a new Row for a predefined DataTable object and assign array values to the columns in the row. Then I add the row to the datatable.
After all the lines are read to the datatable I use OracleBulkCopy to write the data in it to a physical table in the database with the same structure. But the thing is I get SystemOutOfMemoryException as I add the lines to the Datatable object, that is the 3rd step. If I comment out the 3rd step then in the task manager I see that the application consumes the stable amount of memory which is something like 17000 K but if I uncomment that step the memory usage grows unless there's no enough memory to allocate. Is there still a way I can use BulkCopy to perform this or will I have to do it manually? I used BulkCopy becasue it's way faster than inserting lines one by one.
If I understand correctly, you are loading each line in a table that becomes so large as to reach the limits of the memory of your system.
If so, you should find this limit. (For example 1000000 lines). Stop reading the lines well before this point and write the rows loaded so far with OracleBulkCopy. Cleanup your memory and start again. So let me summarize everything with a pseudocode.
int lineLimit = GetConfiguration("lineLimit");
int lineNumber = 0;
DataTable logZyWall = CreateLogTable();
using(StreamReader reader=new StreamReader("C:\ZyWall.log"))
{
while ((line=reader.ReadLine())!=null)
{
DataRow row = ParseThisLine(line);
logZyWall.Rows.Add(row);
lineNumber++;
if(lineNumber == lineLimit)
{
WriteWithOracleBulkCopy(logZyWall);
logZyWall = CreateLogTable();
lineNumber = 0;
}
}
if(lineNumber != 0) WriteWithOracleBulkCopy(logZyWall);
}

Categories