So I'm trying to read an excel file with C# and the document is 181MB. I have tried using Microsoft.Office.Interop.Excel, OpenXML, ClosedXML, and ExcelDataReader.
I wasn't able to get OpenXML to work and ClosedXML seems to have issues with large excel file (it also takes at least 6 minutes to read the file). I like ExcelDataReader the most since I can read the data table like an array but it does take 4-5 minutes to read the file which is much faster than Interlop, but it's still a long wait. I'm considering converting the excel document into a csv file, but when I did that the size went from 181 MB to 248 MB so I'm unsure if it will be more efficient. It also forces the users to do an extra step to convert their files into a csv, but if the performance is worth it I will attempt this route.
Unfortunately, I am not able to pre-determine how many columns and rows the excel document will have as the users will be using openFileDialog to select a file.
Is ExcelDataReader the best way to go or is there a better solution?
Here's my current code in case there's some improvements I can make:
OpenFileDialog openFileDialog = new OpenFileDialog();
openFileDialog.Filter = "Excel Files|*.xls;*.xlsx;*.slxm";
if (openFileDialog.ShowDialog() == true)
{
using (var stream = File.Open(openFileDialog.FileName, FileMode.Open, FileAccess.Read))
{
using (var reader = ExcelReaderFactory.CreateReader(stream))
{
//results will be in dataSet.Tables
var dataSet = reader.AsDataSet();
var dataTable = dataSet.Tables[0];
int r = 0;
for(int c = 0; c < dataTable.Columns.Count; c += 3)
{
TagListData.Add(new TagClass { IsTagSelected = false, TagName = dataTable.Rows[r][c].ToString(), rIndex = r, cIndex = c });
}
}
}
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
}
Idea 1: There is some overhead with ExcelDataReader's AsDataSet - so it's a good idea to use the reader interface directly when working with large sheets. It implements the IDataReader interface and provides pr-row level access to the data:
using (var reader = ExcelReaderFactory.CreateReader(stream)) {
reader.Read();
for(int c = 0; c < reader.FieldCount; c += 3) {
TagListData.Add(new TagClass { IsTagSelected = false, TagName = Convert.ToString(reader.GetValue(c)), rIndex = r, cIndex = c });
}
}
Idea 2: Try to pass ExcelDataSetConfiguration.UseColumnDataType = false to AsDataSet, this eliminates an internal pass and reduces memory pressure, so should improve performance noticably with large sheets
Related
I am trying to use DotNetZip open source library for creating large zip files.
I need to be able to write to each stream writer part of the data row content (see the code below) of the data table. Other limitation I have is that I can't do this in memory due to the contents being large (several giga bytes each entry).
The problem I have is that despite writing to each stream separately, the output is all written to the last entry only. The first entry contains blank. Does anybody have any idea on how to fix this issue?
static void Main(string fileName)
{
var dt = CreateDataTable();
var streamWriters = new StreamWriter[2];
using (var zipOutputStream = new ZipOutputStream(File.Create(fileName)))
{
for (var i = 0; i < 2; i++)
{
var entryName = "file" + i + ".txt";
zipOutputStream.PutNextEntry(entryName);
streamWriters[i] = new StreamWriter(zipOutputStream, Encoding.UTF8);
}
WriteContents(streamWriters[0], streamWriters[1], dt);
zipOutputStream.Close();
}
}
private DataTable CreateDataTable()
{
var dt = new DataTable();
dt.Columns.AddRange(new DataColumn[] { new DataColumn("col1"), new DataColumn("col2"), new DataColumn("col3"), new DataColumn("col4") });
for (int i = 0; i < 100000; i++)
{
var row = dt.NewRow();
for (int j = 0; j < 4; j++)
{
row[j] = j * 1;
}
dt.Rows.Add(row);
}
return dt;
}
private void WriteContents(StreamWriter writer1, StreamWriter writer2, DataTable dt)
{
foreach (DataRow dataRow in dt.Rows)
{
writer1.WriteLine(dataRow[0] + ", " + dataRow[1]);
writer2.WriteLine(dataRow[2] + ", " + dataRow[3]);
}
}
Expected Results:
Both file0.txt and file1.txt need to written.
Actual results:
Only file1.txt file is written all content. file0.txt is blank.
It seems to be the expected behaviour according to the docs
If you don't call Write() between two calls to PutNextEntry(), the first entry is inserted into the zip file as a file of zero size. This may be what you want.
So to me it seems that it is not possible to do what you want through the current API.
Also, as zip file is a continuous sequence of zip entries, it is probably physically impossible to create entries in parallel, as you would have to know the size of each entry before starting a new one.
Perhaps you could just create separate archives and then combine them (if I am not mistaken there was a simple API to do that)
I m getting System Out of Memory exception while creating pivot table with NReco ExcelPivotTableWriter
public void Write(PivotTable pvtTbl)
{
var tbl = getPivotDataAsTable(pvtTbl.PivotData);
var rangePivotTable = wsData.Cells["A1"].LoadFromDataTable(tbl, false);
var pivotTable = ws.PivotTables.Add(
ws.Cells[1, 1],
rangePivotTable, "pvtTable");
foreach (var rowDim in pvtTbl.Rows)
pivotTable.RowFields.Add(pivotTable.Fields[rowDim]);
foreach (var colDim in pvtTbl.Columns)
pivotTable.ColumnFields.Add(pivotTable.Fields[colDim]);
pivotTable.ColumGrandTotals = false;
pivotTable.DataOnRows = false;
pivotTable.ColumGrandTotals = false;
pivotTable.RowGrandTotals = false;
if (pvtTbl.PivotData.AggregatorFactory is CompositeAggregatorFactory)
{
var aggrFactories = ((CompositeAggregatorFactory)pvtTbl.PivotData.AggregatorFactory).Factories;
for (int i = 0; i < aggrFactories.Length; i++)
{
var dt = pivotTable.DataFields.Add(pivotTable.Fields[String.Format("value_{0}", i)]);
dt.Function = SuggestFunction(aggrFactories[i]);
string columnName = "";
if (dt.Function == OfficeOpenXml.Table.PivotTable.DataFieldFunctions.Sum)
columnName = ((NReco.PivotData.SumAggregatorFactory)aggrFactories[i]).Field;
else if(dt.Function == OfficeOpenXml.Table.PivotTable.DataFieldFunctions.Average)
columnName = ((NReco.PivotData.AverageAggregatorFactory)aggrFactories[i]).Field;
if (columnNames.ContainsKey(columnName))
dt.Name = columnNames[columnName].ToString();
else
dt.Name = aggrFactories[i].ToString();
}
}
else
{
pivotTable.DataFields.Add(pivotTable.Fields["value"]).Function = SuggestFunction(pvtTbl.PivotData.AggregatorFactory);
}
}
error occures while creating rangePivotTable
var rangePivotTable = wsData.Cells["A1"].LoadFromDataTable(tbl, false);
The LazyTotal mode is true
var ordersPvtData = new PivotData(dimentionsArray, composite, true);
The dataset has 200k rows. It is not too much i think. I have 8 gb ram on windows 10.
NReco is free version.
Any solution ?
8G may not be enough physical memory depending upon how large each of the 200K rows are and the memory consumption of the other applications running on your system.
Before you run this program, start the Windows Task Manager and click on the Performance tab.
Note the Available and Free Memory values. Then run your program and watch how the memory is consumed. If your program does consume all of your available memory, then your options are...
Free up more memory by removing other applications that consume memory.
Add more physical memory to your system.
Modify your program to make it more memory efficient. (this includes removal of memory leaks)
Some combination of the prior three options.
You should be able to slice through 200k rows pretty easily. Try it like this . . .
Workbook workbook = new Workbook();
workbook.LoadFromFile(#"C:\your_path_here\SampleFile.xlsx");
Worksheet sheet = workbook.Worksheets[0];
sheet.Name = "Data Source";
Worksheet sheet2 = workbook.CreateEmptySheet();
sheet2.Name = "Pivot Table";
CellRange dataRange = sheet.Range["A1:G200000"];
PivotCache cache = workbook.PivotCaches.Add(dataRange);
PivotTable pt = sheet2.PivotTables.Add("Pivot Table", sheet.Range["A1"], cache);
var r1 = pt.PivotFields["Vendor No"];
r1.Axis = AxisTypes.Row;
pt.Options.RowHeaderCaption = "Vendor No";
var r2 = pt.PivotFields["Description"];
r2.Axis = AxisTypes.Row;
pt.DataFields.Add(pt.PivotFields["OnHand"], "SUM of OnHand", SubtotalTypes.Sum);
pt.DataFields.Add(pt.PivotFields["OnOrder"], "SUM of OnOrder", SubtotalTypes.Sum);
pt.DataFields.Add(pt.PivotFields["ListPrice"], "Average of ListPrice", SubtotalTypes.Average);
pt.BuiltInStyle = PivotBuiltInStyles.PivotStyleMedium12;
workbook.SaveToFile("PivotTable.xlsx", ExcelVersion.Version2010);
System.Diagnostics.Process.Start("PivotTable.xlsx");
I have 300 csv files that each file contain 18000 rows and 27 columns.
Now, I want to make a windows form application which import them and show in a datagridview and do some mathematical operation later.
But, my performance is very inefficiently...
After search this problem by google, I found a solution "A Fast CSV Reader".
(http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader)
I'm follow the code step by step, but my datagridview still empty.
I don't know how to solve this problem.
Could anyone tell me how to do or give me another better way to read csv efficiently.
Here is my code...
using System.IO;
using LumenWorks.Framework.IO.Csv;
private void Form1_Load(object sender, EventArgs e)
{
ReadCsv();
}
void ReadCsv()
{
// open the file "data.csv" which is a CSV file with headers
using (CachedCsvReader csv = new
CachedCsvReader(new StreamReader("data.csv"), true))
{
// Field headers will automatically be used as column names
dataGridView1.DataSource = csv;
}
}
Here is my input data:
https://dl.dropboxusercontent.com/u/28540219/20130102.csv
Thanks...
The data you provide contains no headers (first line is a data line). So I got an ArgumentException (item with same key added) when I tried to add the csv reader to the DataSource. Setting the hasHeaders parameter in the CachCsvReader constructor did the trick and it added the data to the DataGridView (very fast).
using (CachedCsvReader csv = new CachedCsvReader(new StreamReader("data.csv"), false))
{
dataGridView.DataSource = csv;
}
Hope this helps!
You can also do like
private void ReadCsv()
{
string filePath = #"C:\..\20130102.csv";
FileStream fileStream = null;
try
{
fileStream = File.Open(filePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
}
catch (Exception ex)
{
return;
}
DataTable table = new DataTable();
bool isColumnCreated = false;
using (StringReader reader = new StringReader(new StreamReader(fileStream, Encoding.Default).ReadToEnd()))
{
while (reader.Peek() != -1)
{
string line = reader.ReadLine();
if (line == null || line.Length == 0)
continue;
string[] values = line.Split(',');
if(!isColumnCreated)
{
for(int i=0; i < values.Count(); i++)
{
table.Columns.Add("Column" + i);
}
isColumnCreated = true;
}
DataRow row = table.NewRow();
for(int i=0; i < values.Count(); i++)
{
row[i] = values[i];
}
table.Rows.Add(row);
}
}
dataGridView1.DataSource = table;
}
Based on you performance requirement, this code can be improvised. It is just a working sample for your reference.
I hope this will give some idea.
I tried to split the file about 32GB using the below code but I got the memory exception.
Please suggest me to split the file using C#.
string[] splitFile = File.ReadAllLines(#"E:\\JKS\\ImportGenius\\0.txt");
int cycle = 1;
int splitSize = Convert.ToInt32(txtNoOfLines.Text);
var chunk = splitFile.Take(splitSize);
var rem = splitFile.Skip(splitSize);
while (chunk.Take(1).Count() > 0)
{
string filename = "file" + cycle.ToString() + ".txt";
using (StreamWriter sw = new StreamWriter(filename))
{
foreach (string line in chunk)
{
sw.WriteLine(line);
}
}
chunk = rem.Take(splitSize);
rem = rem.Skip(splitSize);
cycle++;
}
Well, to start with you need to use File.ReadLines (assuming you're using .NET 4) so that it doesn't try to read the whole thing into memory. Then I'd just keep calling a method to spit the "next" however many lines to a new file:
int splitSize = Convert.ToInt32(txtNoOfLines.Text);
using (var lineIterator = File.ReadLines(...).GetEnumerator())
{
bool stillGoing = true;
for (int chunk = 0; stillGoing; chunk++)
{
stillGoing = WriteChunk(lineIterator, splitSize, chunk);
}
}
...
private static bool WriteChunk(IEnumerator<string> lineIterator,
int splitSize, int chunk)
{
using (var writer = File.CreateText("file " + chunk + ".txt"))
{
for (int i = 0; i < splitSize; i++)
{
if (!lineIterator.MoveNext())
{
return false;
}
writer.WriteLine(lineIterator.Current);
}
}
return true;
}
Do not read immediately all lines into an array, but use StremReader.ReadLine method, like:
using (StreamReader sr = new StreamReader(#"E:\\JKS\\ImportGenius\\0.txt"))
{
while (sr.Peek() >= 0)
{
var fileLine = sr.ReadLine();
//do something with line
}
}
File.ReadAllLines
That will read the whole file into memory.
To work with large files you need to only read what you need now into memory, and then throw that away as soon as you have finished with it.
A better option would be File.ReadLines which returns a lazy enumerator, data is only read into memory as you get the next line from the enumerator. Providing you avoid multiple enumerations (eg. don't use Count()) only parts of the file will be read.
Instead of reading all the file at once using File.ReadAllLines, use File.ReadLines in a foreach loop to read the lines as needed.
foreach (var line in File.ReadLines(#"E:\\JKS\\ImportGenius\\0.txt"))
{
// Do something
}
Edit: On an unrelated note, you don't have to escape your backslashes when prefixing the string with a '#'. So either write "E:\\JKS\\ImportGenius\\0.txt" or #"E:\JKS\ImportGenius\0.txt", but #"E:\\JKS\\ImportGenius\\0.txt" is redundant.
The problem here is that you are reading the entire file's content into memory at once with File.ReadAllLines(). What you need to do is open a FileStream with File.OpenRead() and read/write smaller chunks.
Edit: Actually for your case ReadLine is obviously better. See other answers. :)
Use a StreamReader to read the file, write with a StreamWriter.
I have the below code to generate a CSV file from DataGridView and working fine. But it is not preserving the exact formats. For example, I have 125600.00 in one cell and 08 in another cell in the DataGridView. When I opened the CSV file using Excel, it is showing them as 125600 and 8. And when I opened the CSV file using a notepad, it shows them correct 125600.00 and 08. Is there something that I can do to see the same formats in the excel?
Thanks for any suggestions.
private void btnPrint_Click(object sender, EventArgs e)
{
sbMainFile = new StringBuilder();
int dgcolcount = this.dataGridView1.Columns.Count;
for (int i = 0; i < dgcolcount; i++)
{
sbMainFile.Append(this.dataGridView1.Columns[i].Name);
if (i < dgcolcount - 1)
{
sbMainFile.Append(",");
}
}
sbMainFile.Append("\r\n");
StringBuilder sbRow = null;
foreach (DataGridViewRow row in this.dataGridView1.Rows)
{
sbRow = new StringBuilder();
for (int i = 0; i < dataGridView1.Columns.Count; i++)
{
sbRow.Append(row.Cells[i].Value.ToString());
if (i < dgcolcount - 1)
{
sbRow.Append(",");
}
}
sbMainFile.AppendLine(sbRow.ToString());
}
SaveFileDialog savefile = new SaveFileDialog();
savefile.FileName = "default";
savefile.Filter = "CSV Files | *.csv";
savefile.DefaultExt = "csv";
if (savefile.ShowDialog() == DialogResult.OK)
{
using (System.IO.StreamWriter sw = new System.IO.StreamWriter(savefile.FileName))
sw.Write(sbMainFile.ToString());
}
}
You could save the "column values" as a string; e.g. 08 as ="08"
change
sbRow.Append(row.Cells[i].Value.ToString());
to
sbRow.AppendFormat("=\"{0}\"", row.Cells[i].Value.ToString());
There is nothing wrong with your code. That is actually a problem with Excel formatting your data for you. You don't actually lose any data. You simply have to specify the format in Excel. (If you click on one of your cells where the format is mangled you can actually see the "raw" value in the f(x) editor.)