This is my first post on Stackoverflow.
I am reading millions of rows from a flat file (comma delimited) and iterating each read row and then each column of each row. The iteration of each column is to allow user defined conversions, defaults, removal of special characters, etc. to be performed. The current implementation is very efficient.
The reading of the data is done in batches of 20k. When I'm processing a read row, I issue a NewRow() call on my in-memory DataTable. I then start iterating each column to scrub their values. I'm trying to minimize as much as I can when I'm processing a rows columns.
My problem is this. If the value (text in this case) that is read from the flat file is longer than the MaxLength of the targeted DataTables DataColumn, I receive an exception stating such when I issue the following:
dataTable.Rows.Add(newRow);
Is there a way to tell ADO.Net (or my in-memory DataTable) to truncate the data instead of complaining?
Again, I can easily add logic in the loop to do this check/truncation for me, but those things add up when you're dealing with millions of rows of data.
something like this should work:
var newRow = dataTable.NewRow();
...
...
if(YourText.Length < ColumnMaxLength)
{
newRow["YourLimitedColumnName"] = YourText;
}
else
{
newRow["YourLimitedColumnName"] = YourText.Substring(0, ColumnMaxLength);
}
...
...
dataTable.Rows.Add(newRow);
Related
I have a DataTable object that I need to fill based on data stored in a stream of columns - i.e. the stream initially contains the schema of the DataTable, and subsequently, values that should go into it organised by column.
At present, I'm taking the rather naive approach of
Create enough empty rows to hold all data values.
Fill those rows per cell.
The result is a per-cell iteration, which is not especially quick to say the least.
That is:
// Create rows first...
// Then populate...
foreach (var col in table.Columns.Cast<DataColumn>)
{
List<object> values = GetValuesfromStream(theStream);
// Actual method has some DBNull checking here, but should
// be immaterial to any solution.
for (var i=0; i<values.Count; i++)
table.Rows[i][col] = values[i];
}
My guess is the backing DataStorage items for each column aren't expanding as the rows are added, but as values are added to each column, but I'm far from certain. Any tips for loading this kind of data.
NB that loading all lists first and then reading in by row is probably not sensible - this approach is being taken in the first place to mitigate potential out of memory exceptions that tend to result when serializing huge DataTable objects, so grabbing a clone of the entire data grid and reading it in would probably just move the problem elsewhere. There's definitely enough memory for the original table and another column of values, but there probably isn't for two copies of the DataTable.
Whilst I haven't found a way to avoid iterating cells, as per the comments above, I've found that writing to DataRow items that have already been added to the table turns out to be a bad idea, and was responsible for the vast majority of the slowdown I observed.
The final approach I used ended up looking something like this:
List<DataRow> rows = null;
// Start population...
var cols = table.Columns.Cast<DataColumn>.Where(c => string.IsNullOrEmpty(c.Expression));
foreach (var col in cols)
{
List<object> values = GetValuesfromStream(theStream);
// Create rows first if required.
if (rows == null)
{
rows = new List<DataRow>();
for (var i=0; i<values.Count; i++)
rows.Add(table.NewRow());
}
// Actual method has some DBNull checking here, but should
// be immaterial to any solution.
for (var i=0; i<values.Count; i++)
rows[i][col] = values[i];
}
rows.ForEach(r => table.Rows.Add(r));
This approach addresses two problems:
If you try to add an empty DataRow to a table that has null-restrictions or similar, then you'll get an error. This approach ensures all the data is there before it's added, which should address most such issues (although I haven't had need to check how it works with auto-incrementing PK columns).
Where expressions are involved, these are evaluated when row state changes for a row that has been added to a table. Consequently, where before I had re-calculation of all expressions taking place every time a value was added to a cell (expensive and pointless), now all calculation takes place just once after all base data has been added.
There may of course be other complications with writing to a table that I've not yet encountered because the tables I am making use of don't use those features of the DataTable class/model. But for simple cases, this works well.
This question is about finding a more efficient way for a simple problem. I have two DataTables with same structure (i.e. the Columns have same name with same Ordinals). Let them Call DataTable A and DataTable B. Assume both have 100 rows. Now I want to copy all the rows of DataTable B to DataTable A without removing rows from DataTable A. So in the end DataTable A has 200 rows. I did it as shown below.
for (int i = 0; i < B.Rows.Count - 1;i++ )
{
DataRow dr = B.Rows[i];
A.Rows.Add(dr);
}
The issue is I do not want to loop. Is there a direct way to copy it, without looping. The whole 100 rows at once. Is there a function which specifies the set of rows you want to copy.
As far as I know, there is no other way of copying multiple rows from one Datatable to another than iterating through all the rows. In fact, on MSDN there is an article telling you how to copy rows between Datatables and uses an iteration loop.
https://support.microsoft.com/en-gb/kb/305346
There are some problems with your simple approach because it doesnt handle primary key violations. Try BeginLoadData, LoadDataRow and EndLoadData. This should be more efficient. BeginLoadData and EndLoadData call only once.
If you just need a new independent DataTable instance to work with and do not need to append rows to an existing DataTable, then the DataView.ToTable() method is very convenient.
https://msdn.microsoft.com/en-us/library/a8ycds2f(v=vs.110).aspx
It creates a separate copy with the same schema and content.
DataTable objTableB = objTableA.DefaultView.ToTable();
I´m loading csv data from files into a datatable for processing.
The problem is, that I want to process several files and my tests with the datatable shows me huge memory consumption
I tested with a 37MB csv file and the memory growed up to 240MB, which is way to much IMHO.
I read, that there is overhead in the datatable and I could live with about 70MB in size , but not 240MB, which means it is six times the original size.
I read here, that datatables need more memory than POCOs, but that the difference is way too much.
I put on a memory profiler and looked, if I have memory leaks and where the memory is. I found, that the datatablecolumns have between 6MB and 19MB filled with strings and the datatable had about 20 columns. Are the values stored in the columns? Why is so much memory taken, what can I do to reduce memory consumption.
With this memory consumption datattables seem to be unusable.
Had somebody else such problems with datatables, or I´m doing something wrong?
PS: I tried a 70MB file and the datatable growed up to 500MB!
OK here is a small testcase:
The 37MB csv-file (21 columns) let the memory grow up to 179MB.
private static DataTable ReadCsv()
{
DataTable table = new DataTable();
table.BeginLoadData();
using (var reader = new StreamReader(File.OpenRead(#"C:\Develop\Tests\csv-Data\testdaten\test.csv")))
{
int y = 0;
int columnsCount = 0;
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
var values = line.Split(',');
if (y == 0)
{
columnsCount = values.Count();
// create columns
for (int x = 0; x < columnsCount; x++)
{
table.Columns.Add(new DataColumn(values[x], typeof(string)));
}
}
else
{
if (values.Length == columnsCount)
{
// add the data
table.Rows.Add(values);
}
}
y++;
}
table.EndLoadData();
table.AcceptChanges();
}
return table;
}
DataSet and its children DataTable, DataRow, etc. make up an in-memory relational database. There is a lot of overhead involved (though it does make [some] things very convenient.
If memory is an issue,
Build domain objects to represent each row in your CSV file with typed properties.
Create a custom collection (or just use IList<T> to hold them
Alternatively, build a light-weight class with the basic semantics of a DataTable:
the ability to select a row by number
the ability to select a column within a row by row number and either column name or number.
The ability to know the ordered set of column names
Bonus: The ability to select a column by name or ordinal number and receive a list of its values, one per row.
Are you sure you need an in-memory representation of your CSV files? Could you access them via an IDataReader like Sebastien Lorion's Fast CSV Reader?
DataTables are a generic solution of putting tablular data into memory and adding lots of table-related features. If the overhead is not acceptable for you have the option to 1) write your own DataTable class that eliminates the overhead that you don't need 2) Use an alternate representation that still accomplishes what you need, perhaps POCO based, or maybe an XMLDocument (May have just as much overhead maybe more, never really worried about it). 3) Stop trying to load everything into memory and just bring data in as needed from your external store.
I have an object structure that is mimicking the properties of an excel table. So i have a table object containing properties such as title, header row object and body row objects. Within the header row and each body row object, i have a cell object containing info on each cell per row. I am looking for a more efficient way to store this table structure since in one of my uses for this object, i am printing its structure to screen. Currently, i am doing an O(n^2) complexity for printing each row for each cell:
foreach(var row in Table.Rows){
foreach(var cell in row.Cells){
Console.WriteLine(cell.ToString())
}
}
Is there a more efficient way of storing this structure to avoid the n^2? I ask this because this printing functionality exists in another n^2 loop. Basically i have a list of tables titles and a list of tables. I need to find those tables whose titles are in the title list. Then for each of those tables, i need to print their rows and the cells in each row. Can any part of this operation be optimized by using a different data structure for storage perhaps? Im not sure how exactly they work but i have heard of hashing and dictionary?
Thanks
Since you are looking for tables with specific titles, you could use a dictionary to store the tables by title
Dictionary<string,Table> tablesByTitle = new Dictionary<string,Table>();
tablesByTitle.Add(table.Title, table);
...
table = tablesByTitle["SomeTableTitle"];
This would make finding a table an O(1) operation. Finding n tables would be an O(n) operation.
Printing the tables then of cause depends on the number of rows and columns. There is nothing, which can change that.
UPDATE:
string tablesFromGuiElement = "Employees;Companies;Addresses";
string[] selectedTables = tablesFromGuiElement.Split(';');
foreach (string title in selectedTables) {
Table tbl = tablesByTitle[title];
PrintTable(tbl);
}
There isn't anything more efficient than an N^2 operation for outputting an NxN matrix of values. Worst-case, you will always be doing this.
Now, if instead of storing the values in a multidimensional collection that defines the graphical relationship of rows and columns, you put them in a one-dimensional collection and included the row-column information with each cell, then you would only need to iterate through the cells that had values. Worst-case is still N^2 for a table of N rows and N columns that is fully populated (the one-dimensional array, though linear to enumerate, will have N^2 items), but the best case would be that only one cell in that table is populated (or none are) which would be constant-time.
This answer applies to the, printing the table part, but the question was extended.
for the getting the table part, see the other answer.
No, there is not.
Unless perhaps your values follow some predictable distribution, then you could use a function of x and y and store no data at all, or maybe a seed and a function.
You could cache the print output in a string or StringBuider if you require it multiple times.
If there is enough data I guess you might apply some compression algorithm but I wouldn't say that was simpler or more efficient.
C# with .net 2.0 with a SQL server 2005 DB backend.
I've a bunch of XML files which contain data along the lines of the following, the structure varies a little but is more or less as follows:
<TankAdvisory>
<WarningType name="Tank Overflow">
<ValidIn>All current tanks</ValidIn>
<Warning>Tank is close to capacity</Warning>
<IssueTime Issue-time="2011-02-11T10:00:00" />
<ValidFrom ValidFrom-time="2011-01-11T13:00:00" />
<ValidTo ValidTo-time="2011-01-11T14:00:00" />
</WarningType>
</TankAdvisory>
I have a single DB table that has all the above fields ready to be filled.
When I use the following method of reading the data from the XML file:
DataSet reportData = new DataSet();
reportData.ReadXml("../File.xml");
It successfully populates the Dataset but with multiple tables. So when I come to use SQLBulkCopy I can either save just one table this way:
sbc.WriteToServer(reportData.Tables[0]);
Or if I loop through all the tables in the Dataset adding them it adds a new row in the Database, when in actuality they're all to be stored in the one row.
Then of course there's also the issue of columnmappings, I'm thinking that maybe SQLBulkCopy is the wrong way of doing this.
What I need to do is find a quick way of getting the data from that XML file into the Database under the relevant columns in the DB.
Ok, so the original question is a little old, but i have just came across a way to resolve this issue.
All you need to do is loop through all the DataTables that are in your DataSet and add them to the One DataTable that has all the columns in the Table in your DB like so...
DataTable dataTable = reportData.Tables[0];
//Second DataTable
DataTable dtSecond = reportData.Tables[1];
foreach (DataColumn myCol in dtSecond.Columns)
{
sbc.ColumnMappings.Add(myCol.ColumnName, myCol.ColumnName);
dataTable.Columns.Add(myCol.ColumnName);
dataTable.Rows[0][myCol.ColumnName] = dtSecond.Rows[0][myCol];
}
//Finally Perform the BulkCopy
sbc.WriteToServer(dataTable);
foreach (DataColumn myCol in dtSecond.Columns)
{
dataTable.Columns.Add(myCol.ColumnName);
for (int intRowcnt = 0; intRowcnt <= dtSecond.Rows.Count - 1; intRowcnt++)
{
dataTable.Rows[intRowcnt][myCol.ColumnName] = dtSecond.Rows[intRowcnt][myCol];
}
}
SqlBulkCopy is for many inserts. It's perfect for those cases when you would otherwise generate a lot of INSERT statements and juggle the limit on total number of parameters per batch. The thing about the SqlBulkCopy class though, is that it's a cranky. Unless you fully specify all column mappings for the data set it will throw an exception.
I'm assuming that your data is quite manageable since your reading it into a DataSet. If you where to have even larger data sets you could lift chunks into memory and then flush them to the database piece by piece. But if everything fits in one go, it's as simple as that.
The SqlBulkCopy is the fastest way to put data into the database. Just setup column mappings for all the columns, otherwise it won't work.
Why reinvent the wheel? Use SSIS. Read with an XML Source, transform with one of the many Transformations, then load it with an OLE Db Destination into the SQL Server table. You will never beat SSIS in terms of runtime, speed to deploy the solution, maintenance, error handling etc etc.