Ways to increase performance of DataTable.Load()? - c#

I currently use a custom CSV class from Codeproject to create a CSV object. I then use this to populate a DataTable. Under profiling this is taking more time than I would like and I wonder if there is a more efficient way of doing it?
The CSV contains approximately 2,500 rows and 500 columns.
The CSV reader is from: http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader
StreamReader s = new StreamReader(confirmedFilePath);
CsvReader csv = new CsvReader(s, true);
DataTable dt = new DataTable();
dt.Load(csv);
I came across a google search suggesting a DataAdapter, but it was only one reference to this? I searched further but didn't find any collaboration.

CsvReader is fast and reliable, I almost sure you can't find anything faster (if there is at all) for reading CSV data.
Limitation comes from DataTable processing new data, 2500*500 thats qiute of amount. I think fastest way would be direct CsvReader->DataBase (ADO.NET) chain.

Give GenericParser a try.

Always use the BeginLoadData() and EndLoadData() when filling from databases, as they already enforce constraints by themselves - the only downside is that a CSV file obviously does not, so any exception is thrown only after the whole operation ends.
...
dt.BeginLoadData();
dt.Load(csv, LoadOption.Upsert);
dt.EndLoadData();
EDIT: Use the LoadOption.Upsert only if the DataBase is empty, or you don't want to preserve any previous changes to existing data - its even more faster that way.

Related

Functions Save() and SaveAs() are too slow

I'm trying to convert from Json to Excel. The json is huge. So, i can't use a directly convert.
I'm talking about 12 millions of entries at least.
I'm reading Json file with JsonReader and converting part by part to DataTable.
ExcelSheet has 1048576 rows limit. So, I need to create differents sheets.
So, i'm loading differents sheets from DataTables. The problem is when all my DataTables are loaded, the Save() operation never ends.
A little snippet:
private void LoadDataTable(DataTable dt, ExcelPackage ep, string newName){
OfficeOpenXml.ExcelWorksheet sheet = ep.Workbook.Worksheets.Add(newName);
sheet.Cells.LoadFromDataTable(dt, true);
}
static void Main(string[] args)
{
using (ExcelPackage ep = new ExcelPackage(new FileInfo(output)))
using (StreamReader sw = new StreamReader(input))
using (JsonTextReader jr = new JsonTextReader(sw))
{
while(jr.Read()){
DataTable dt = new DataTable();
.........
//Filling DataTable with data.
.........
LoadDataTable(dt,ep,"foo"+i);
} //The total of the the iterations takes 6 minutes more or less
ep.Save();// Never ends. Here is my problem.
}
}
i think the operation sheet.Cells.LoadFromDataTable(dt, true); load all the data in memory but not in a file. When ep.Save()runs, it starts a dump from memory to a file. so, it is extremaly ineficient.
Is any way to write directly in a excel file? or how can i do ep.Save() faster?
UPDATE:
I found thislink.
I'm using .NET Core and the Epplus version is v4.5.3.2
IMHO, having Excel workbooks of 12 millions records has to be discouraged.
How do you think users can work with so huge amount of data ?
This is very bad design.
You should rather use a database to import and store all that stuff and then implement SQL queries which result can be integrated in smaller excel files.
If you MUST use excel in this case (wholly cow thats going to be a big file!) I strongly advise you to avoid using any of the LoadFrom*() methods built into EPPlus and write your own loops. Those methods are handy but come at a major performance cost since they have to account for ALL conditions and not just yours. I have shaved off not seconds but minutes in exports simply by writing my own for/while loops.
As far as improving SaveAs() you are at the mercy of the library at that point. I have had much smaller data sets take as much as 10-15 minutes to generate the XLSX (dont ask :o). About the only way to improve that would be to generate the raw XML that is saved in the XLSX zip file itself to bypass all of the library logic because, again, it has to account for ALL possibilities. But this is no small feat - alot has to go into mapping the cells and files in the zip property which is why I never put the time into figuring it out.
Assuming you've already argued with your team that Excel is not a database tool, and for some reason have been told that it's not up for discussion -
There's a couple things you could try here:
Load the data into several separate excel files after doing some experimentation regarding how much data can be efficiently saved into a single file. This is different from using separate sheets in the sense that you can clear out memory between saves. Plus, whoever is loading this already will need some wonky reader that looks through different Excel sheets; it wouldn't be difficult to modify that to read through different files instead.
Save the data as a .csv file, and then convert it to an Excel format later (or not at all!). The limitation here is that you again cannot use Excel sheets, so you'd end up having to (getting to) take Excel out of the equation all together, or once again save as many different Excel files.

Csv-files to Sql-database c#

What is the best approach to store information gathered locally in .csv-files with a C#.net sql-database? My reasons for asking is
1: The data i am to handle is massive (millions of rows in each csv). 2: The data is extremely precise since it describes measurements on a nanoscopic scale, and is therefor delicate.
My first though was to store each row of the csv in a correspondant row in the database. I did this using The DataTable.cs-class. When done, i feelt that if something goes wrong when parsing the .csv-file, i would never notice.
My second though is to upload the .csvfiles to a database in it's .csv-format and later parse the file from the database to the local enviroment when the user asks for it. If even possible in c#.net with visual stuido 2013, how could this be done in a efficient and secure manner?
I used .Net DataStreams library from csv reader in my project. It uses the SqlBulkCopy class, though it is not free.
Example:
using (CsvDataReader csvData = new CsvDataReader(path, ',', Encoding.UTF8))
{
// will read in first record as a header row and
// name columns based on the values in the header row
csvData.Settings.HasHeaders = true;
csvData.Columns.Add("nvarchar");
csvData.Columns.Add("float"); // etc.
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(connection))
{
bulkCopy.DestinationTableName = "DestinationTable";
bulkCopy.BulkCopyTimeout = 3600;
// Optionally, you can declare columnmappings using the bulkCopy.ColumnMappings property
bulkCopy.WriteToServer(csvData);
}
}
It sounds like you are simply asking whether you should store a copy of the source CSV in the database, so if there was an import error you can check to see what happened after the fact.
In my opinion, this is probably not a great idea. It immediately makes me ask, how would you know that an error had occurred? You certainly shouldn't rely on humans noticing the mistake so you must develop a way to programmatically check for errors. If you have an automated error checking method you should apply that method when the import occurs and avoid the error in the first place. Do you see the circular logic here?
Maybe I'm missing something but I don't see the benefit of storing the CSV.
You should probably use Bulk Insert. With your csv-file as a source.
But this will only work if the file is accessible from the PC that is running your SQL Server.
Here you can find a nice solution as well. To be short it looks like this:
StreamReader file = new StreamReader(bulk_data_filename);
CsvReader csv = new CsvReader(file, true,',');
SqlBulkCopy copy = new SqlBulkCopy(conn);
copy.DestinationTableName = tablename;
copy.WriteToServer(csv);

Read csv logfiles with different headers/columns

I need to read multiple csv files and merge them. The merged data is used for generating a chart (with the .NET chart control).
So far I've done this with a simple streamreader and added everything to one DataTable:
while (sr.Peek() > -1)
{
strLine = sr.ReadLine();
strLine = strLine.TrimEnd(';');
strArray = strLine.Split(delimiter);
dataTableMergedData.Rows.Add(strArray);
}
But now there is the problem, that the logfiles can change. As you can see here, newer logfiles have got additional columns:
My current procedure doesn't work now and I'm asking for advice how to do this. Performance is important due to the fact, that every logfile contains about 1500 lines and up to 100 columns and the logfiles get merged up to a one-year-period (equals 365 files).
I would do it that way: Creating a DataTable, which should contain all data at the end and reading each logfile into a seperate DataTable. After each read operation I would add the seperate DataTable to the "big" DataTable, check if columns have changed and add the new columns if they did.
But I'm afraid that using DataTables would affect the performance.
Note: I'm doing this with winforms, but I think that doesn't matter anyway.
Edit: Tried CsvReader but this is about 4 times slower than my current solution.
After hours of testing I did it the way I described it in the question:
Firstly I created a DataTable which should contain all the data at the end. Then I am going through all logfiles by a foreach-loop and for every logfile I create another DataTable and fill it with the csv-data from the logfile. This table gets added to the first DataTable and no matter if they have different columns, they get added properly.
This may costs some performance compared to a simple StreamReader, but It is more easier to extend and still faster than the LumenWorks CsvReader.

What is the fastest way to load a big data set into a GridView?

I have a data source with 1.4+ millions rows in it, and growing.
We make the users add filters to cut the called data down, but you are still looking at 43,000+/- to 100,000 +/- rows at a time.
Before any one says, no one can look at that many rows anyway, they are exported to a excel workbook for calculations based on them.
I am loading the result as follows in the GridView from the CSV file that is returned:
Object result = URIService.data;
CSVReader csvReader = new CSVReader(result);
DataTable dataTable = csvReader.CreateDataTable(true, true);
If(dataTable != null)
{
gridView1.BeginUpdate();
gridView1.DataSource = dataTable;
gridView1.DataBind()
gridView1.EndUpdate();
}
Else
{
Return;
}
CSVReader is a CSV Parser.
My question is, is this the best and most efficient way to load a large data set to a gridview?
EDIT: Would using a list for the rows or something other than a data table be better?
I think there is only one way to load the large data set into grid-view and it is the one you are using right now, but if you want to make the performance better I highly recommend using pagination so you have chunks of data loaded on every page therefore you will decrease the loading time
http://sivanandareddyg.blogspot.com/2011/11/efficient-server-side-paging-with.html
http://www.codeproject.com/Articles/125541/Effective-Paging-with-GridView-Control-in-ASP-NET
https://web.archive.org/web/20211020140032/https://www.4guysfromrolla.com/articles/031506-1.aspx
Did you try to use buffered renderer?
in the case of SQL SERVER use SqlBulkCopy class to copy large data with the highest speed

C# console app. SQLBulkCopy and quickly entering XML into a SQL Server DB

C# with .net 2.0 with a SQL server 2005 DB backend.
I've a bunch of XML files which contain data along the lines of the following, the structure varies a little but is more or less as follows:
<TankAdvisory>
<WarningType name="Tank Overflow">
<ValidIn>All current tanks</ValidIn>
<Warning>Tank is close to capacity</Warning>
<IssueTime Issue-time="2011-02-11T10:00:00" />
<ValidFrom ValidFrom-time="2011-01-11T13:00:00" />
<ValidTo ValidTo-time="2011-01-11T14:00:00" />
</WarningType>
</TankAdvisory>
I have a single DB table that has all the above fields ready to be filled.
When I use the following method of reading the data from the XML file:
DataSet reportData = new DataSet();
reportData.ReadXml("../File.xml");
It successfully populates the Dataset but with multiple tables. So when I come to use SQLBulkCopy I can either save just one table this way:
sbc.WriteToServer(reportData.Tables[0]);
Or if I loop through all the tables in the Dataset adding them it adds a new row in the Database, when in actuality they're all to be stored in the one row.
Then of course there's also the issue of columnmappings, I'm thinking that maybe SQLBulkCopy is the wrong way of doing this.
What I need to do is find a quick way of getting the data from that XML file into the Database under the relevant columns in the DB.
Ok, so the original question is a little old, but i have just came across a way to resolve this issue.
All you need to do is loop through all the DataTables that are in your DataSet and add them to the One DataTable that has all the columns in the Table in your DB like so...
DataTable dataTable = reportData.Tables[0];
//Second DataTable
DataTable dtSecond = reportData.Tables[1];
foreach (DataColumn myCol in dtSecond.Columns)
{
sbc.ColumnMappings.Add(myCol.ColumnName, myCol.ColumnName);
dataTable.Columns.Add(myCol.ColumnName);
dataTable.Rows[0][myCol.ColumnName] = dtSecond.Rows[0][myCol];
}
//Finally Perform the BulkCopy
sbc.WriteToServer(dataTable);
foreach (DataColumn myCol in dtSecond.Columns)
{
dataTable.Columns.Add(myCol.ColumnName);
for (int intRowcnt = 0; intRowcnt <= dtSecond.Rows.Count - 1; intRowcnt++)
{
dataTable.Rows[intRowcnt][myCol.ColumnName] = dtSecond.Rows[intRowcnt][myCol];
}
}
SqlBulkCopy is for many inserts. It's perfect for those cases when you would otherwise generate a lot of INSERT statements and juggle the limit on total number of parameters per batch. The thing about the SqlBulkCopy class though, is that it's a cranky. Unless you fully specify all column mappings for the data set it will throw an exception.
I'm assuming that your data is quite manageable since your reading it into a DataSet. If you where to have even larger data sets you could lift chunks into memory and then flush them to the database piece by piece. But if everything fits in one go, it's as simple as that.
The SqlBulkCopy is the fastest way to put data into the database. Just setup column mappings for all the columns, otherwise it won't work.
Why reinvent the wheel? Use SSIS. Read with an XML Source, transform with one of the many Transformations, then load it with an OLE Db Destination into the SQL Server table. You will never beat SSIS in terms of runtime, speed to deploy the solution, maintenance, error handling etc etc.

Categories