Functions Save() and SaveAs() are too slow

Functions Save() and SaveAs() are too slow - c#

I'm trying to convert from Json to Excel. The json is huge. So, i can't use a directly convert.
I'm talking about 12 millions of entries at least.
I'm reading Json file with JsonReader and converting part by part to DataTable.
ExcelSheet has 1048576 rows limit. So, I need to create differents sheets.
So, i'm loading differents sheets from DataTables. The problem is when all my DataTables are loaded, the Save() operation never ends.
A little snippet:
private void LoadDataTable(DataTable dt, ExcelPackage ep, string newName){
OfficeOpenXml.ExcelWorksheet sheet = ep.Workbook.Worksheets.Add(newName);
sheet.Cells.LoadFromDataTable(dt, true);
}
static void Main(string[] args)
{
using (ExcelPackage ep = new ExcelPackage(new FileInfo(output)))
using (StreamReader sw = new StreamReader(input))
using (JsonTextReader jr = new JsonTextReader(sw))
{
while(jr.Read()){
DataTable dt = new DataTable();
.........
//Filling DataTable with data.
.........
LoadDataTable(dt,ep,"foo"+i);
} //The total of the the iterations takes 6 minutes more or less
ep.Save();// Never ends. Here is my problem.
}
}
i think the operation sheet.Cells.LoadFromDataTable(dt, true); load all the data in memory but not in a file. When ep.Save()runs, it starts a dump from memory to a file. so, it is extremaly ineficient.
Is any way to write directly in a excel file? or how can i do ep.Save() faster?
UPDATE:
I found thislink.
I'm using .NET Core and the Epplus version is v4.5.3.2

IMHO, having Excel workbooks of 12 millions records has to be discouraged.
How do you think users can work with so huge amount of data ?
This is very bad design.
You should rather use a database to import and store all that stuff and then implement SQL queries which result can be integrated in smaller excel files.

If you MUST use excel in this case (wholly cow thats going to be a big file!) I strongly advise you to avoid using any of the LoadFrom*() methods built into EPPlus and write your own loops. Those methods are handy but come at a major performance cost since they have to account for ALL conditions and not just yours. I have shaved off not seconds but minutes in exports simply by writing my own for/while loops.
As far as improving SaveAs() you are at the mercy of the library at that point. I have had much smaller data sets take as much as 10-15 minutes to generate the XLSX (dont ask :o). About the only way to improve that would be to generate the raw XML that is saved in the XLSX zip file itself to bypass all of the library logic because, again, it has to account for ALL possibilities. But this is no small feat - alot has to go into mapping the cells and files in the zip property which is why I never put the time into figuring it out.

Assuming you've already argued with your team that Excel is not a database tool, and for some reason have been told that it's not up for discussion -
There's a couple things you could try here:
Load the data into several separate excel files after doing some experimentation regarding how much data can be efficiently saved into a single file. This is different from using separate sheets in the sense that you can clear out memory between saves. Plus, whoever is loading this already will need some wonky reader that looks through different Excel sheets; it wouldn't be difficult to modify that to read through different files instead.
Save the data as a .csv file, and then convert it to an Excel format later (or not at all!). The limitation here is that you again cannot use Excel sheets, so you'd end up having to (getting to) take Excel out of the equation all together, or once again save as many different Excel files.

Related

Why DataSet.ReadXml() slow at first reading? How to make it faster?

I have to read XML file located on a website (currently still local). I'm using C# on windows form application, and I use the following code:
try
{
DataSet dsMain = new DataSet();
dsMain.ReadXml(txtUrl.Text);
}
catch (Exception exx)
{
MessageBox.Show(exx.Message);
}
Those code runs well, but the problem is dsMain.ReadXml() method is slow at first connection to the website. To prove this, i surround it with Stopwatch like below:
try
{
Stopwatch st = new Stopwatch();
st.Start();
DataSet dsMain = new DataSet();
dsMain.ReadXml(txtUrl.Text);
st.Stop();
MessageBox.Show(Math.Round(st.Elapsed.TotalSeconds, 2).ToString(), "XML reading cost");
}
catch (Exception exx)
{
MessageBox.Show(exx.Message);
}
The message box showed about 2-3 seconds for first loading, and about 0-0.01 second for every next reading during the application. If I close the application and run it again, this problem occur again. FYI, the XML file is small (under 10 KB).
So the question is, why DataSet.ReadXml() method is slow for first reading but fast for every next reading? How to speed up this method? Is there any code improvement I should add?

It is slow the first time because the runtime is generating, dynamically, the code to do the de-serialisation.
To avoid this just use one of the .NET XML parsers directly into your own data structures optimised for your data (DataSet itself adds a lot of overhead by being dynamic and having a generalised interface).

Probably because it tries to parse (or infer) the schema from the xml file. At a later stage (parsing a file for the second time) it doesn't create a schema anymore, but just adds the data to the table.
https://msdn.microsoft.com/en-us/library/360dye2a(v=vs.110).aspx
The ReadXml method provides a way to read either data only, or both data and schema into a DataSet from an XML document, whereas the ReadXmlSchema method reads only the schema. To read both data and schema, use one of the ReadXML overloads that includes the mode parameter, and set its value to ReadSchema.
...
If no in-line schema is specified, the relational structure is extended through inference, as necessary, according to the structure of the XML document. If the schema cannot be extended through inference in order to expose all data, an exception is raised.

I suppose the garbage collector keeps data in memory

Csv-files to Sql-database c#

What is the best approach to store information gathered locally in .csv-files with a C#.net sql-database? My reasons for asking is
1: The data i am to handle is massive (millions of rows in each csv). 2: The data is extremely precise since it describes measurements on a nanoscopic scale, and is therefor delicate.
My first though was to store each row of the csv in a correspondant row in the database. I did this using The DataTable.cs-class. When done, i feelt that if something goes wrong when parsing the .csv-file, i would never notice.
My second though is to upload the .csvfiles to a database in it's .csv-format and later parse the file from the database to the local enviroment when the user asks for it. If even possible in c#.net with visual stuido 2013, how could this be done in a efficient and secure manner?

I used .Net DataStreams library from csv reader in my project. It uses the SqlBulkCopy class, though it is not free.
Example:
using (CsvDataReader csvData = new CsvDataReader(path, ',', Encoding.UTF8))
{
// will read in first record as a header row and
// name columns based on the values in the header row
csvData.Settings.HasHeaders = true;
csvData.Columns.Add("nvarchar");
csvData.Columns.Add("float"); // etc.
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(connection))
{
bulkCopy.DestinationTableName = "DestinationTable";
bulkCopy.BulkCopyTimeout = 3600;
// Optionally, you can declare columnmappings using the bulkCopy.ColumnMappings property
bulkCopy.WriteToServer(csvData);
}
}

It sounds like you are simply asking whether you should store a copy of the source CSV in the database, so if there was an import error you can check to see what happened after the fact.
In my opinion, this is probably not a great idea. It immediately makes me ask, how would you know that an error had occurred? You certainly shouldn't rely on humans noticing the mistake so you must develop a way to programmatically check for errors. If you have an automated error checking method you should apply that method when the import occurs and avoid the error in the first place. Do you see the circular logic here?
Maybe I'm missing something but I don't see the benefit of storing the CSV.

You should probably use Bulk Insert. With your csv-file as a source.
But this will only work if the file is accessible from the PC that is running your SQL Server.
Here you can find a nice solution as well. To be short it looks like this:
StreamReader file = new StreamReader(bulk_data_filename);
CsvReader csv = new CsvReader(file, true,',');
SqlBulkCopy copy = new SqlBulkCopy(conn);
copy.DestinationTableName = tablename;
copy.WriteToServer(csv);

Read csv logfiles with different headers/columns

I need to read multiple csv files and merge them. The merged data is used for generating a chart (with the .NET chart control).
So far I've done this with a simple streamreader and added everything to one DataTable:
while (sr.Peek() > -1)
{
strLine = sr.ReadLine();
strLine = strLine.TrimEnd(';');
strArray = strLine.Split(delimiter);
dataTableMergedData.Rows.Add(strArray);
}
But now there is the problem, that the logfiles can change. As you can see here, newer logfiles have got additional columns:
My current procedure doesn't work now and I'm asking for advice how to do this. Performance is important due to the fact, that every logfile contains about 1500 lines and up to 100 columns and the logfiles get merged up to a one-year-period (equals 365 files).
I would do it that way: Creating a DataTable, which should contain all data at the end and reading each logfile into a seperate DataTable. After each read operation I would add the seperate DataTable to the "big" DataTable, check if columns have changed and add the new columns if they did.
But I'm afraid that using DataTables would affect the performance.
Note: I'm doing this with winforms, but I think that doesn't matter anyway.
Edit: Tried CsvReader but this is about 4 times slower than my current solution.

After hours of testing I did it the way I described it in the question:
Firstly I created a DataTable which should contain all the data at the end. Then I am going through all logfiles by a foreach-loop and for every logfile I create another DataTable and fill it with the csv-data from the logfile. This table gets added to the first DataTable and no matter if they have different columns, they get added properly.
This may costs some performance compared to a simple StreamReader, but It is more easier to extend and still faster than the LumenWorks CsvReader.

Ways to increase performance of DataTable.Load()?

I currently use a custom CSV class from Codeproject to create a CSV object. I then use this to populate a DataTable. Under profiling this is taking more time than I would like and I wonder if there is a more efficient way of doing it?
The CSV contains approximately 2,500 rows and 500 columns.
The CSV reader is from: http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader
StreamReader s = new StreamReader(confirmedFilePath);
CsvReader csv = new CsvReader(s, true);
DataTable dt = new DataTable();
dt.Load(csv);
I came across a google search suggesting a DataAdapter, but it was only one reference to this? I searched further but didn't find any collaboration.

CsvReader is fast and reliable, I almost sure you can't find anything faster (if there is at all) for reading CSV data.
Limitation comes from DataTable processing new data, 2500*500 thats qiute of amount. I think fastest way would be direct CsvReader->DataBase (ADO.NET) chain.

Give GenericParser a try.

Always use the BeginLoadData() and EndLoadData() when filling from databases, as they already enforce constraints by themselves - the only downside is that a CSV file obviously does not, so any exception is thrown only after the whole operation ends.
...
dt.BeginLoadData();
dt.Load(csv, LoadOption.Upsert);
dt.EndLoadData();
EDIT: Use the LoadOption.Upsert only if the DataBase is empty, or you don't want to preserve any previous changes to existing data - its even more faster that way.

Import data from excel into multiple tables

I'm building an offline C# application that will import data off spread sheets and store them in a SQL Database that I have created (Inside the Project). Through some research I have been able to use some code that can import a static table, into a Database that is exactly the same layout as the columns in the worksheet
What I"m looking to do is have specific columns go to their correct tables based on name. This way I have the database designed correctly and not just have one giant table to store everything.
Below is the code I'm using to import a few static fields into one table, I want to be able to split the imported data into more than one.
What is the best way to do this?
public partial class Form1 : Form
{
string strConnection = ConfigurationManager.ConnectionStrings
["Test3.Properties.Settings.Test3ConnectionString"].ConnectionString;
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
//Create connection string to Excel work book
string excelConnectionString =
#"Provider=Microsoft.Jet.OLEDB.4.0;
Data Source=C:\Test.xls;
Extended Properties=""Excel 8.0;HDR=YES;""";
//Create Connection to Excel work book
OleDbConnection excelConnection = new OleDbConnection(excelConnectionString);
//Create OleDbCommand to fetch data from Excel
OleDbCommand cmd = new OleDbCommand
("Select [Failure_ID], [Failure_Name], [Failure_Date], [File_Name], [Report_Name], [Report_Description], [Error] from [Failures$]", excelConnection);
excelConnection.Open();
OleDbDataReader dReader;
dReader = cmd.ExecuteReader();
SqlBulkCopy sqlBulk = new SqlBulkCopy(strConnection);
sqlBulk.DestinationTableName = "Failures";
sqlBulk.WriteToServer(dReader);
}

You can try an ETL (extract-transform-load) architecture:
Extract: One class will open the file and get all the data in chunks you know how to work with (usually you take a single row from the file and parse its data into a POCO object containing fields that hold pertinent data), and put those into a Queue that other work processes can take from. In this case, maybe the first thing you do is have Excel open the file and re-save it as a CSV, so you can reopen it as basic text in your process and chop it up efficiently. You can also read the column names and build a "mapping dictionary"; this column is named that, so it goes to this property of the data object. This process should happen as fast as possible, and the only reason it should fail is because the format of a row doesn't match what you're looking for given the structure of the file.
Transform: Once the file's contents have been extracted into an instance of a basic row, perform any validation, calculations or other business rules necessary to turn a row from the file into a set of domain objects that conform to your domain model. This process can be as complex as you need it to be, but again it should be as straightforward as you can make it while obeying all the business rules given in your requirements.
Load: Now you've got an object graph in your own domain objects, you can use the same persistence framework you'd call to handle domain objects created any other way. This could be basic ADO, an ORM like NHibernate or MSEF, or an Active Record pattern where objects know how to persist themselves. It's no bulk load, but it saves you having to implement a completely different persistence model just to get file-based data into the DB.
An ETL workflow can help you separate the repetitive tasks into simple units of work, and from there you can identify the tasks that take a lot of time and consider parallel processes.
Alternately, you can take the file and massage its format by detecting columns you want to work with, and arranging them into a format that matches your bulk input spec, before calling a bulk insert routine to process the data. This file processor routine can do anything you want it to, including separating data into several files. However, it's one big process that works on a whole file at a time and has limited opportunities for optimization or parallel processing. However, if your loading mechanism is slow, or you've got a LOT of data that is simple to digest, it may end up faster than even a well-designed ETL.
In any case, I would get away from an Office format and into a plain-text (or XML) format as soon as I possibly could, and I would DEFINITELY avoid having to install Office on a server. If there is ANY way you can require the files be in some easily-parseable format like CSV BEFORE they're loaded, so much the better. Having an Office installation on a server is a Really Bad Thing in general, and OLE operations in a server app is not much better. The app will be very brittle, and anything Office wants to tell you will cause the app to hang until you log onto the server and clear the dialog box.

If you were looking for a more code related answer, you could use the following to modify your code to work with difficult column names / different tables:
private void button1_Click(object sender, EventArgs e)
{
//Create connection string to Excel work book
string excelConnectionString =
#"Provider=Microsoft.Jet.OLEDB.4.0;
Data Source=C:\Test.xls;
Extended Properties=""Excel 8.0;HDR=YES;""";
//Create Connection to Excel work book
OleDbConnection excelConnection = new OleDbConnection(excelConnectionString);
//Create OleDbCommand to fetch data from Excel
OleDbCommand cmd = new OleDbCommand
("Select [Failure_ID], [Failure_Name], [Failure_Date], [File_Name], [Report_Name], [Report_Description], [Error] from [Failures$]", excelConnection);
excelConnection.Open();
DataTable dataTable = new DataTable();
dataTable.Columns.Add("Id", typeof(System.Int32));
dataTable.Columns.Add("Name", typeof(System.String));
// TODO: Complete other table columns
using(OleDbDataReader dReader = cmd.ExecuteReader())
{
DataRow dataRow = dataTable.NewRow();
dataRow["Id"] = dReader.GetInt32(0);
dataRow["Name"] = dReader.GetString(1);
// TODO: Complete other table columns
dataTable.Rows.Add(dataRow);
}
SqlBulkCopy sqlBulk = new SqlBulkCopy(strConnection);
sqlBulk.DestinationTableName = "Failures";
sqlBulk.WriteToServer(dataTable);
}
Now you can control the names of the columns and which tables the data gets imported into. SqlBulkCopy is good for insert large amounts of data. If you only have a small amount of rows, you might be better off creating a standard data access layer to insert your records.

If you are only interested in the text (not the formatting etc.), alternatively you can save the excel file as CSV file, and parse the CSV file instead, it's simple.

Depending on the lifetime of the program, I would recommend one of two options.
If the program is to be short lived in use, or generally a "throw away" project, I would recommend a series of routines which parse and input data into another set of tables using standard SQL with some string processing as needed.
If the program will stick around longer and/or find more use on a day-to-day basis, I would recommend implementing a solution similar to the one recommended by #KeithS. With a set of well defined steps for working with the data, much flexibility is gained. More specifically, the .NET Entity Framework would probably be a great fit.
As a bonus, if you're not already well versed in this area, you might find you learn a great deal about working with data between boundaries (xls -> sql -> etc.) during your first stint with an ORM such as EF.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.