C# Excel Interop Slow when looping through cells - c#

I am trying to extract all text data from an Excel document in C# and am having performance issues. In the following code I open the Workbook, loop over all worksheets, and loop over all cells in the used range, extracting the text from each cell as I go. The problem is, this takes 14 seconds to execute.
public class ExcelFile
{
public string Path = #"C:\test.xlsx";
private Excel.Application xl = new Excel.Application();
private Excel.Workbook WB;
public string FullText;
private Excel.Range rng;
private Dictionary<string, string> Variables;
public ExcelFile()
{
WB = xl.Workbooks.Open(Path);
xl.Visible = true;
foreach (Excel.Worksheet CurrentWS in WB.Worksheets)
{
rng = CurrentWS.UsedRange;
for (int i = 1; i < rng.Count; i++)
{ FullText += rng.Cells[i].Value; }
}
WB.Close(false);
xl.Quit();
}
}
Whereas in VBA I would do something like this, which takes ~1 second:
Sub run()
Dim strText As String
For Each ws In ActiveWorkbook.Sheets
For Each c In ws.UsedRange
strText = strText & c.Text
Next c
Next ws
End Sub
Or, even faster (less than 1 second):
Sub RunFast()
Dim strText As String
Dim varCells As Variant
For Each ws In ActiveWorkbook.Sheets
varCells = ws.UsedRange
For i = 1 To UBound(varCells, 1)
For j = 1 To UBound(varCells, 2)
strText = strText & CStr(varCells(i, j))
Next j
Next i
Next ws
End Sub
Perhaps something is happening in the for loop in C# that I'm not aware of? Is it possible to load a range into an array-type object (as in my last example) to allow iteration over just the values, not the cell objects?

Excel and C# run in different environments completely. C# runs in the .NET framework using managed memory while Excel is a native C++ application and runs in unmanaged memory. Translating data between these two (a process called "marshaling") is extremely expensive in terms of performance.
Tweaking your code isn't going to help. For loops, string construction, etc. are all blazingly fast compared to the marshaling process. The only way you are going to get significantly better performance is to reduce the number of trips that have to cross the interprocess boundary. Extracting data cell by cell is never going to get you the performance you want.
Here are a couple options:
Write a sub or function in VBA that does everything you want, then call that sub or function via interop. Walkthrough.
Use interop to save the worksheet to a temporary file in CSV format, then open the file using C#. You will need to loop through and parse the file to get it into a useful data structure, but this loop will go much faster.
Use interop to save a range of cells to the clipboard, then use C# to read the clipboard directly.

I use this function. The loops are only for converting to array starting at index 0, the main work is done in object[,] tmp = range.Value.
public object[,] GetTable(int row, int col, int width, int height)
{
object[,] arr = new object[height, width];
Range c1 = (Range)Worksheet.Cells[row + 1, col + 1];
Range c2 = (Range)Worksheet.Cells[row + height, col + width];
Range range = Worksheet.get_Range(c1, c2);
object[,] tmp = range.Value;
for (int i = 0; i < height; ++i)
{
for (int j = 0; j < width; ++j)
{
arr[i, j] = tmp[i + tmp.GetLowerBound(0), j + tmp.GetLowerBound(1)];
}
}
return arr;
}

One thing which will speed it up is to use a StringBuilder instead of += on the previous string. Strings are immutable in C# and therefore you are creating a ton of extra strings during your process of creating the final string.
Additionally you may improve performance looping over the row, column positions instead of looping over the index.
Here is the code changed with a StringBuilder and row, column positional looping:
public class ExcelFile
{
public string Path = #"C:\test.xlsx";
private Excel.Application xl = new Excel.Application();
private Excel.Workbook WB;
public string FullText;
private Excel.Range rng;
private Dictionary<string, string> Variables;
public ExcelFile()
{
StringBuilder sb = new StringBuilder();
WB = xl.Workbooks.Open(Path);
xl.Visible = true;
foreach (Excel.Worksheet CurrentWS in WB.Worksheets)
{
rng = CurrentWS.UsedRange;
for (int i = 1; i <= rng.Rows.Count; i++)
{
for (int j = 1; j <= rng.Columns.Count; j++)
{
sb.append(rng.Cells[i, j].Value);
}
}
}
FullText = sb.ToString();
WB.Close(false);
xl.Quit();
}
}

I sympathize with you pwwolff. Looping through Excel cells can be expensive. Antonio and Max are both correct but John Wu's answer sums it up nicely. Using string builder may speed things up and making an object array from the used range IMHO is about as fast as you are going to get using interop. I understand there are other third party libraries that may perform better. Looping through each cell will take an unacceptable amount of time if the file is large using interop.
On the tests below I used a workbook with a single sheet where the sheet has 11 columns and 100 rows of used range data. Using an object array implementation this took a little over a second. With 735 rows it took around 40 seconds.
I put 3 buttons on a form with a multi line text box. The first button uses your posted code. The second button takes the ranges out of the loops. The third button uses an object array approach. Each one has a significant performance improvement over the other. I used a text box on the form to output the data, you can use a string as you are but using a string builder would be better if you must have one big string.
Again, if the files are large you may want to consider another implementation. Hope this helps.
private void button1_Click(object sender, EventArgs e) {
Stopwatch sw = new Stopwatch();
MessageBox.Show("Start DoExcel...");
sw.Start();
DoExcel();
sw.Stop();
MessageBox.Show("End DoExcel...Took: " + sw.Elapsed.Seconds + " seconds and " + sw.Elapsed.Milliseconds + " Milliseconds");
}
private void button2_Click(object sender, EventArgs e) {
MessageBox.Show("Start DoExcel2...");
Stopwatch sw = new Stopwatch();
sw.Start();
DoExcel2();
sw.Stop();
MessageBox.Show("End DoExcel2...Took: " + sw.Elapsed.Seconds + " seconds and " + sw.Elapsed.Milliseconds + " Milliseconds");
}
private void button3_Click(object sender, EventArgs e) {
MessageBox.Show("Start DoExcel3...");
Stopwatch sw = new Stopwatch();
sw.Start();
DoExcel3();
sw.Stop();
MessageBox.Show("End DoExcel3...Took: " + sw.Elapsed.Seconds + " seconds and " + sw.Elapsed.Milliseconds + " Milliseconds");
}
// object[,] array implementation
private void DoExcel3() {
textBox1.Text = "";
string Path = #"D:\Test\Book1 - Copy.xls";
Excel.Application xl = new Excel.Application();
Excel.Workbook WB;
Excel.Range rng;
WB = xl.Workbooks.Open(Path);
xl.Visible = true;
int totalRows = 0;
int totalCols = 0;
foreach (Excel.Worksheet CurrentWS in WB.Worksheets) {
rng = CurrentWS.UsedRange;
totalCols = rng.Columns.Count;
totalRows = rng.Rows.Count;
object[,] objectArray = (object[,])rng.Cells.Value;
for (int row = 1; row < totalRows; row++) {
for (int col = 1; col < totalCols; col++) {
if (objectArray[row, col] != null)
textBox1.Text += objectArray[row,col].ToString();
}
textBox1.Text += Environment.NewLine;
}
}
WB.Close(false);
xl.Quit();
Marshal.ReleaseComObject(WB);
Marshal.ReleaseComObject(xl);
}
// Range taken out of loops
private void DoExcel2() {
textBox1.Text = "";
string Path = #"D:\Test\Book1 - Copy.xls";
Excel.Application xl = new Excel.Application();
Excel.Workbook WB;
Excel.Range rng;
WB = xl.Workbooks.Open(Path);
xl.Visible = true;
int totalRows = 0;
int totalCols = 0;
foreach (Excel.Worksheet CurrentWS in WB.Worksheets) {
rng = CurrentWS.UsedRange;
totalCols = rng.Columns.Count;
totalRows = rng.Rows.Count;
for (int row = 1; row < totalRows; row++) {
for (int col = 1; col < totalCols; col++) {
textBox1.Text += rng.Rows[row].Cells[col].Value;
}
textBox1.Text += Environment.NewLine;
}
}
WB.Close(false);
xl.Quit();
Marshal.ReleaseComObject(WB);
Marshal.ReleaseComObject(xl);
}
// original posted code
private void DoExcel() {
textBox1.Text = "";
string Path = #"D:\Test\Book1 - Copy.xls";
Excel.Application xl = new Excel.Application();
Excel.Workbook WB;
Excel.Range rng;
WB = xl.Workbooks.Open(Path);
xl.Visible = true;
foreach (Excel.Worksheet CurrentWS in WB.Worksheets) {
rng = CurrentWS.UsedRange;
for (int i = 1; i < rng.Count; i++) {
textBox1.Text += rng.Cells[i].Value;
}
}
WB.Close(false);
xl.Quit();
Marshal.ReleaseComObject(WB);
Marshal.ReleaseComObject(xl);
}

Related

SSIS Export to Excel using Script Task

I'm trying to use a Script Task to export data to Excel because some of the reports I generate simply have too many columns to keep using a template file.
The most annoying part about using a template is: if something as simple as a column header changes, the metadata gets screwed forcing me to recreate my DataFlow. Because I use an OLE DB source, I need to use a Data Transformation task to convert between unicode and non-unicode character sets, then remap my Excel Destination to the "Copy of field x" in order for the Excel document to create properly.
This takes far too long and I need a new approach.
I have the following method in a script task using Excel = Microsoft.Office.Interop.Excel:
private void ExportToExcel(DataTable dataTable, string excelFilePath = null)
{
Excel.Application excelApp = new Excel.Application();
Excel.Worksheet workSheet = null;
try
{
if (dataTable == null || dataTable.Columns.Count == 0)
throw new System.Exception("Null or empty input table!" + Environment.NewLine);
excelApp.Workbooks.Add();
workSheet = excelApp.ActiveSheet;
for (int i = 0; i < dataTable.Columns.Count; i++)
{
workSheet.Cells[1, (i + 1)] = dataTable.Columns[i].ColumnName;
}
foreach (DataTable dt in dataSet.Tables)
{
// Copy the DataTable to an object array
object[,] rawData = new object[dt.Rows.Count + 1, dt.Columns.Count];
// Copy the column names to the first row of the object array
for (int col = 0; col < dt.Columns.Count; col++)
{
rawData[0, col] = dt.Columns[col].ColumnName;
}
// Copy the values to the object array
for (int col = 0; col < dt.Columns.Count; col++)
{
for (int row = 0; row < dt.Rows.Count; row++)
{
rawData[row + 1, col] = dt.Rows[row].ItemArray[col];
}
}
// Calculate the final column letter
string finalColLetter = string.Empty;
string colCharset = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
int colCharsetLen = colCharset.Length;
if (dt.Columns.Count > colCharsetLen)
{
finalColLetter = colCharset.Substring((dt.Columns.Count - 1) / colCharsetLen - 1, 1);
}
finalColLetter += colCharset.Substring((dt.Columns.Count - 1) % colCharsetLen, 1);
workSheet.Name = dt.TableName;
// Fast data export to Excel
string excelRange = string.Format("A1:{0}{1}", finalColLetter, dt.Rows.Count + 1);
//The code crashes here (ONLY in SSIS):
workSheet.get_Range(excelRange, Type.Missing).Value2 = rawData;
// Mark the first row as BOLD
((Excel.Range)workSheet.Rows[1, Type.Missing]).Font.Bold = true;
}
List<int> lstColumnsToSum = new List<int>() { 9 };
Dictionary<int, string> dictColSumName = new Dictionary<int, string>() { { 9, "" } };
Dictionary<int, decimal> dictColumnSummation = new Dictionary<int, decimal>() { { 9, 0 } };
// rows
for (int i = 0; i < dataTable.Rows.Count; i++)
{
for (int j = 1; j <= dataTable.Columns.Count; j++)
{
workSheet.Cells[(i + 2), (j)] = dataTable.Rows[i][j - 1];
if (lstColumnsToSum.Exists(x => (x == j)))
{
decimal val = 0;
if (decimal.TryParse(dataTable.Rows[i][j - 1].ToString(), out val))
{
dictColumnSummation[j] += val;
}
}
}
}
//Footer
int footerRowIdx = 2 + dataTable.Rows.Count;
foreach (var summablecolumn in dictColSumName)
{
workSheet.Cells[footerRowIdx, summablecolumn.Key] = String.Format("{0}", dictColumnSummation[summablecolumn.Key]);
}
// check fielpath
if (excelFilePath != null && excelFilePath != "")
{
try
{
if (File.Exists(excelFilePath))
File.Delete(excelFilePath);
workSheet.Activate();
workSheet.Application.ActiveWindow.SplitRow = 1;
workSheet.Application.ActiveWindow.FreezePanes = true;
int row = 1;
int column = 1;
foreach (var item in dataTable.Columns)
{
Excel.Range range = workSheet.Cells[row, column] as Excel.Range;
range.NumberFormat = "#";
range.EntireColumn.AutoFit();
range.Interior.Color = System.Drawing.ColorTranslator.ToOle(System.Drawing.Color.LightGray);
column++;
}
Excel.Range InternalCalculatedAmount = workSheet.Cells[1, 9] as Excel.Range;
InternalCalculatedAmount.EntireColumn.NumberFormat = "#0.00";
InternalCalculatedAmount.Columns.AutoFit();
workSheet.SaveAs(excelFilePath);
}
catch (System.Exception ex)
{
throw new System.Exception("Excel file could not be saved! Check filepath." + Environment.NewLine + ex.Message);
}
}
else // no filepath is given
{
excelApp.Visible = true;
}
}
catch (System.Exception ex)
{
throw new System.Exception("ex.Message + Environment.NewLine, ex.InnerException);
}
}
The exception thrown is a System.OutOfMemoryException when trying to execute the following piece of code:
workSheet.get_Range(excelRange, Type.Missing).Value2 = rawData;
My biggest frustration is that this method works 100% in a regular C# application.
The DataTable contains about 435000 rows. I know it's quite a bit of data but I use this very method, modified of course, to split data across multiple Excel worksheets in one of my other applications, and that DataSet contains about 1.1m rows. So less than half of my largest DataSet should be a walk-in-the-park...
Any light shed on this matter would be amazing!

How to write on multiple worksheets using EPPlus

I'm using the following code snippet to write some data into an excel file using EPPlus. My application does some big data processing and since excel has a limit of ~1 million rows, space runs out time to time. So what I am trying to achieve is this, once a System.ArgumentException : row out of range is detected or in other words.. no space is left in the worksheet.. the remainder of the data will be written in the 2nd worksheet in the same workbook. I have tried the following code but no success yet. Any help will be appreciated!
try
{
for (int i = 0; i < data.Count(); i++)
{
var cell1 = ws.Cells[rowIndex, colIndex];
cell1.Value = data[i];
colIndex++;
}
rowIndex++;
}
catch (System.ArgumentException)
{
for (int i = 0; i < data.Count(); i++)
{
var cell2 = ws1.Cells[rowIndex, colIndex];
cell2.Value = data[i];
colIndex++;
}
rowIndex++;
}
You shouldnt use a catch to handle that kind of logic - it is more for a last resort. Better to engineer you code to deal with your situation since this is very predictable.
The excel 2007 format has a hard limit of 1,048,576 rows. With that, you know exactly how many rows you should put before going to a new sheet. From there it is simple for loops and math:
[TestMethod]
public void Big_Row_Count_Test()
{
var existingFile = new FileInfo(#"c:\temp\temp.xlsx");
if (existingFile.Exists)
existingFile.Delete();
const int maxExcelRows = 1048576;
using (var package = new ExcelPackage(existingFile))
{
//Assume a data row count
var rowCount = 2000000;
//Determine number of sheets
var sheetCount = (int)Math.Ceiling((double)rowCount/ maxExcelRows);
for (var i = 0; i < sheetCount; i++)
{
var ws = package.Workbook.Worksheets.Add(String.Format("Sheet{0}", i));
var sheetRowLimit = Math.Min((i + 1)*maxExcelRows, rowCount);
//Remember +1 for 1-based excel index
for (var j = i * maxExcelRows + 1; j <= sheetRowLimit; j++)
{
var cell1 = ws.Cells[j - (i*maxExcelRows), 1];
cell1.Value = j;
}
}
package.Save();
}
}

How to iterate through Excel Worksheets only extracting data from specific columns

How do you iterate through an excel workbook with multiple worksheets only extracting data from say columns "C", "E" & "F"?
Here is the code I have thus far:
public static string ExtractData(string filePath)
{
Excel.Application excelApp = new Excel.Application();
Excel.Workbook workBook = excelApp.Workbooks.Open(filePath);
string data = string.Empty;
int i = 0;
foreach (Excel.Worksheet sheet in workBook.Worksheets)
{
data += "******* Sheet " + i++.ToString() + " ********\n";
//foreach (Excel.Range row in sheet.UsedRange.Rows)
//{
// data += row.Range["C"].Value.ToString();
//}
foreach (Excel.Range row in sheet.UsedRange.Rows)
{
foreach (Excel.Range cell in row.Columns)
{
data += cell.Value + " ";
}
data += "\n";
}
}
excelApp.Quit();
return data;
}
Thank you very much for your time, any help is appreciated.
Editing your method, here's something should do what you're looking for:
public static string ExtractData(string filePath)
{
Excel.Application excelApp = new Excel.Application();
Excel.Workbook workBook = excelApp.Workbooks.Open(filePath);
int[] Cols = { 3, 5, 6 }; //Columns to loop
//C, E, F
string data = string.Empty;
int i = 0;
foreach (Excel.Worksheet sheet in workBook.Worksheets)
{
data += "******* Sheet " + i++.ToString() + " ********\n";
foreach (Excel.Range row in sheet.UsedRange.Rows)
{
foreach (int c in Cols) //changed here to loop through columns
{
data += sheet.Cells[row.Row, c].Value2.ToString() + " ";
}
data += "\n";
}
}
excelApp.Quit();
return data;
}
I've created a int array to indicate which columns you'd like to read from, and then on each row we just loop through the array.
HTH,
Z
You could use something like this to get Column C for example:
var numberOfRows = sheet.UsedRange.Columns[3, Type.Missing].Rows.Count;
var values = sheet.Range["C1:C" + numberOfRows].Value2;
numberOfRows holds the number of rows in the worksheet (I think it doesn't skip the blank rows at the top tho, not sure). After that you select a range from C1 to CN and get the Value2 which contains the values. Mind that the values array is actually a two dimensional array. You can now easily do a for loop to get the items:
for (int i = 1; i <= values.Length; i++){
sb.Append(values[i, 1] + " ");
}
This could be optimized in case columns are next to each other and such, but the above code should get you started.

Fastest way to drop a DataSet into a worksheet

A rather higeisch dataset with 16000 x 12 entries needs to be dumped into a worksheet.
I use the following function now:
for (int r = 0; r < dt.Rows.Count; ++r)
{
for (int c = 0; c < dt.Columns.Count; ++c)
{
worksheet.Cells[c + 1][r + 1] = dt.Rows[r][c].ToString();
}
}
I rediced the example to the center piece
Here is what i implemented after reading the suggestion from Dave Zych.
This works great.
private static void AppendWorkSheet(Excel.Workbook workbook, DataSet data, String tableName)
{
Excel.Worksheet worksheet;
if (UsedSheets == 0) worksheet = workbook.Worksheets[1];
else worksheet = workbook.Worksheets.Add();
UsedSheets++;
DataTable dt = data.Tables[0];
var valuesArray = new object[dt.Rows.Count, dt.Columns.Count];
for (int r = 0; r < dt.Rows.Count; ++r)
{
for (int c = 0; c < dt.Columns.Count; ++c)
{
valuesArray[r, c] = dt.Rows[r][c].ToString();
}
}
Excel.Range c1 = (Excel.Range)worksheet.Cells[1, 1];
Excel.Range c2 = (Excel.Range)worksheet.Cells[dt.Rows.Count, dt.Columns.Count];
Excel.Range range = worksheet.get_Range(c1, c2);
range.Cells.Value2 = valuesArray;
worksheet.Name = tableName;
}
Build a 2D array of your values from your DataSet, and then you can set a range of values in Excel to the values of the array.
object valuesArray = new object[dataTable.Rows.Count, dataTable.Columns.Count];
for(int i = 0; i < dt.Rows.Count; i++)
{
//If you know the number of columns you have, you can specify them this way
//Otherwise use an inner for loop on columns
valuesArray[i, 0] = dt.Rows[i]["ColumnName"].ToString();
valuesArray[i, 1] = dt.Rows[i]["ColumnName2"].ToString();
...
}
//Calculate the second column value by the number of columns in your dataset
//"O" is just an example in this case
//Also note: Excel is 1 based index
var sheetRange = worksheet.get_Range("A2:O2",
string.Format("A{0}:O{0}", dt.Rows.Count + 1));
sheetRange.Cells.Value2 = valuesArray;
This is much, much faster than looping and setting each cell individually. If you're setting each cell individually, you have to talk to Excel through COM (for lack of a better phrase) for each cell (which in your case is ~192,000 times), which is incredibly slow. Looping, building your array and only talking to Excel once removes much of that overhead.

c# Microsoft.Office.Interop.Excel export

I'm writing a program in which I'm using C# language, DataSet, etc. I have about 200 000 values what I want to export to an .xlsx document.
My code:
using Excel = Microsoft.Office.Interop.Excel;
...
Excel.Application excelApp = new Excel.Application();
Excel.Workbook excelworkbook = excelApp.Workbooks.Open(/location/);
Excel._Worksheet excelworkSheet = (Excel.Worksheet)excelApp.ActiveSheet;
...
excelApp.visible = true;
...
for (int i = 0; i < /value/; i++)
for (int j = 0; j < /value/; j++)
excelworkSheet.Cells[i, j] = /value/;
It works well, but it is too slow (at least 5-10 minutes).
Have you got any advice?
I just took the same performance hit, wrote this to benchmark:
[Test]
public void WriteSpeedTest()
{
var excelApp = new Application();
var workbook = excelApp.Workbooks.Add();
var sheet = (Worksheet)workbook.Worksheets[1];
int n = 1000;
var stopwatch = Stopwatch.StartNew();
SeparateWrites(sheet, n);
Console.WriteLine("SeparateWrites(sheet, " + n + "); took: " + stopwatch.ElapsedMilliseconds + " ms");
stopwatch.Restart();
BatchWrite(sheet, n);
Console.WriteLine("BatchWrite(sheet, " + n + "); took: " + stopwatch.ElapsedMilliseconds + " ms");
workbook.SaveAs(Path.Combine(#"C:\TEMP", "Test"));
workbook.Close(false);
Marshal.FinalReleaseComObject(excelApp);
}
private static void BatchWrite(Worksheet sheet, int n)
{
string[,] strings = new string[n, 1];
var array = Enumerable.Range(1, n).ToArray();
for (var index = 0; index < array.Length; index++)
{
strings[index, 0] = array[index].ToString();
}
sheet.Range["B1", "B" + n].set_Value(null, strings);
}
private static void SeparateWrites(Worksheet sheet, int n)
{
for (int i = 1; i <= n; i++)
{
sheet.Cells[i, 1].Value = i.ToString();
}
}
Results:
n = 100 n = 1 000 n = 10 000
SeparateWrites(sheet, n); 180 ms 1125 ms 10972 ms
BatchWrite(sheet, n); 3 ms 4 ms 14 ms
For Excel, I only programmed VBA so I cannot give you the exact syntax on how to do it in C#.
What I notice though is that you are doing something that I have noticed many people are tempted to:
Writing code to each cell in Excel separately.
Read / Write operations are rather slow in comparison to operations performed in memory.
It would be more interesting to pass an array of data to a function that writes all of these data to a defined range in one step. Before doing so, of course you need to set the dimensions of the range correctly (equal to the size of the array).
However, when doing so, performance should be increased.

Categories