I would like to be able to efficiently retrieve a multi-dimensional array of formatted cell values from Excel. When I say formatted values, I mean I would like to get them exactly as they appear in Excel with all the cell NumberFormat applied.
The Range.Value and Range.Value2 properties work great for retrieving the cell values of a large number of cells into a multi-dimensional array. But those are the actual cell values (well at least with Range.Value2 is, I'm not quite sure what Range.Value is doing with respect to some of the values).
If I want to retrieve the actual text that is displayed in the cells, I can use the Range.Text property. This has some caveats. First, you need to AutoFit the cells or else you may get something like #### if not all the text is visible with the current cell width. Secondly, Range.Text does not work for more than one cell at a time so you would have to loop through all of the cells in the range and this can be extremely slow for large data sets.
The other method that I tried is to copy the range into the clipboard and then parse the clipboard text as a tab-separated data stream and transfer it into a multi-dimensional array. This seems to work great, although it is slower than getting Range.Value2, it is much faster for large datasets than getting Range.Text. However, I don't like the idea of using the system clipboard. If this was a really long operation that takes 60 seconds and while that operation is running, the user may decide to switch to another application and would be very unhappy to find that their clipboard either doesn't work or has mysterious data in it.
Is there a way that I can retrieve the formatted cell values to a multi-dimensional array efficiently?
I have added some sample code that is run from a couple ribbon buttons in a VSTO app. The first set some good test values and number formats and the second button will display what they look like when retrieved using one of these methods in a MessageBox.
The sample output on my system is(It could be different on yours due to Regional Settings):
Output using Range.Value
1/25/2008 3:19:32 PM 5.12345
2008-01-25 15:19:32 0.456
Output using Range.Value2
39472.6385648148 5.12345
2008-01-25 15:19:32 0.456
Output using Clipboard Copy
1/25/2008 15:19 5.12
2008-01-25 15:19:32 45.60%
Output using Range.Text and Autofit
1/25/2008 15:19 5.12
2008-01-25 15:19:32 45.60%
The Range.Text and Clipboard methods produce the correct output, but as explained above they both have problems: Range.Text is slow and Clipboard is bad practice.
private void SetSampleValues()
{
var sheet = (Microsoft.Office.Interop.Excel.Worksheet) Globals.ThisAddIn.Application.ActiveSheet;
sheet.Cells.ClearContents();
sheet.Cells.ClearFormats();
var range = sheet.Range["A1"];
range.NumberFormat = "General";
range.Value2 = "2008-01-25 15:19:32";
range = sheet.Range["A2"];
range.NumberFormat = "#";
range.Value2 = "2008-01-25 15:19:32";
range = sheet.Range["B1"];
range.NumberFormat = "0.00";
range.Value2 = "5.12345";
range = sheet.Range["B2"];
range.NumberFormat = "0.00%";
range.Value2 = ".456";
}
private string ArrayToString(ref object[,] vals)
{
int dim1Start = vals.GetLowerBound(0); //Excel Interop will return index-1 based arrays instead of index-0 based
int dim1End = vals.GetUpperBound(0);
int dim2Start = vals.GetLowerBound(1);
int dim2End = vals.GetUpperBound(1);
var sb = new StringBuilder();
for (int i = dim1Start; i <= dim1End; i++)
{
for (int j = dim2Start; j <= dim2End; j++)
{
sb.Append(vals[i, j]);
if (j != dim2End)
sb.Append("\t");
}
sb.Append("\n");
}
return sb.ToString();
}
private void GetCellValues()
{
var sheet = (Microsoft.Office.Interop.Excel.Worksheet)Globals.ThisAddIn.Application.ActiveSheet;
var usedRange = sheet.UsedRange;
var sb = new StringBuilder();
sb.Append("Output using Range.Value\n");
var vals = (object [,]) usedRange.Value; //1-based array
sb.Append(ArrayToString(ref vals));
sb.Append("\nOutput using Range.Value2\n");
vals = (object[,])usedRange.Value2; //1-based array
sb.Append(ArrayToString(ref vals));
sb.Append("\nOutput using Clipboard Copy\n");
string previousClipboardText = Clipboard.GetText();
usedRange.Copy();
string clipboardText = Clipboard.GetText();
Clipboard.SetText(previousClipboardText);
vals = new object[usedRange.Rows.Count, usedRange.Columns.Count]; //0-based array
ParseClipboard(clipboardText,ref vals);
sb.Append(ArrayToString(ref vals));
sb.Append("\nOutput using Range.Text and Autofit\n");
//if you dont autofit, Range.Text may give you something like #####
usedRange.Columns.AutoFit();
usedRange.Rows.AutoFit();
vals = new object[usedRange.Rows.Count, usedRange.Columns.Count];
int startRow = usedRange.Row;
int endRow = usedRange.Row + usedRange.Rows.Count - 1;
int startCol = usedRange.Column;
int endCol = usedRange.Column + usedRange.Columns.Count - 1;
for (int r = startRow; r <= endRow; r++)
{
for (int c = startCol; c <= endCol; c++)
{
vals[r - startRow, c - startCol] = sheet.Cells[r, c].Text;
}
}
sb.Append(ArrayToString(ref vals));
MessageBox.Show(sb.ToString());
}
//requires reference to Microsoft.VisualBasic to get TextFieldParser
private void ParseClipboard(string text, ref object[,] vals)
{
using (var tabReader = new TextFieldParser(new StringReader(text)))
{
tabReader.SetDelimiters("\t");
tabReader.HasFieldsEnclosedInQuotes = true;
int row = 0;
while (!tabReader.EndOfData)
{
var fields = tabReader.ReadFields();
for (int i = 0; i < fields.Length; i++)
vals[row, i] = fields[i];
row++;
}
}
}
private void button1_Click(object sender, RibbonControlEventArgs e)
{
SetSampleValues();
}
private void button2_Click(object sender, RibbonControlEventArgs e)
{
GetCellValues();
}
I've found a partial solution. Apply the NumberFormat value to the parsed double of Value2. This only works for single cells as returning an array for NumberFormat with different formats in the array returns System.DBNull.
double.Parse(o.Value2.ToString()).ToString(o.NumberFormat.ToString())
The dates don't work with this though. If you know which columns contains certain things, like a formatted date, you can use DateTime.FromOADate on the double and then value.ToString(format) with the NumberFormat. The code below gets close but is not complete.
<snip>
sb.Append("\nOutput using Range.Value2\n");
vals = (object[,])usedRange.Value2; //1-based array
var format = GetFormat(usedRange);
sb.Append(ArrayToString(ref vals, format));
</snip>
private static object[,] GetFormat(Microsoft.Office.Interop.Excel.Range range)
{
var rows = range.Rows.Count;
var cols = range.Columns.Count;
object[,] vals = new object[rows, cols];
for (int r = 1; r <= rows; ++r)
{
for (int c = 1; c <= cols; ++c)
{
vals[r-1, c-1] = range[r, c].NumberFormat;
}
}
return vals;
}
private static string ArrayToString(ref object[,] vals, object[,] numberformat = null)
{
int dim1Start = vals.GetLowerBound(0); //Excel Interop will return index-1 based arrays instead of index-0 based
int dim1End = vals.GetUpperBound(0);
int dim2Start = vals.GetLowerBound(1);
int dim2End = vals.GetUpperBound(1);
var sb = new StringBuilder();
for (int i = dim1Start; i <= dim1End; i++)
{
for (int j = dim2Start; j <= dim2End; j++)
{
if (numberformat != null)
{
var format = numberformat[i-1, j-1].ToString();
double v;
if (double.TryParse(vals[i, j].ToString(), out v))
{
if (format.Contains(#"/") || format.Contains(":"))
{// parse a date
var date = DateTime.FromOADate(v);
sb.Append(date.ToString(format));
}
else
{
sb.Append(v.ToString(format));
}
}
else
{
sb.Append(vals[i, j].ToString());
}
}
else
{
sb.Append(vals[i, j]);
}
if (j != dim2End)
sb.Append("\t");
}
sb.Append("\n");
}
return sb.ToString();
}
One solution to your problem is to use:
Range(XYZ).Value(11) = Range(ABC).Value(11)
As described here, this will:
Returns the recordset representation of the specified Range object in an XML format.
Assuming that your excel is configured in OpenXML format, this will copy the value/formula AND the formatting of range ABC and inject it into range XYZ.
Additionally, this answer explains the difference between Value and Value2.
.Value2 gives you the underlying value of the cell (could be empty, string, error, number (double) or boolean)
.Value gives you the same as .Value2 except if the cell was formatted as currency or date it gives you a VBA currency (which may truncate decimal places) or VBA date.
Related
I have a lot of excel files that contains data and it contains empty rows and empty columns.
like shown bellow
I am trying to remove Empty rows and columns from excel using interop.
I create a simple winform application and used the following code and it works fine.
Dim lstFiles As New List(Of String)
lstFiles.AddRange(IO.Directory.GetFiles(m_strFolderPath, "*.xls", IO.SearchOption.AllDirectories))
Dim m_XlApp = New Excel.Application
Dim m_xlWrkbs As Excel.Workbooks = m_XlApp.Workbooks
Dim m_xlWrkb As Excel.Workbook
For Each strFile As String In lstFiles
m_xlWrkb = m_xlWrkbs.Open(strFile)
Dim m_XlWrkSheet As Excel.Worksheet = m_xlWrkb.Worksheets(1)
Dim intRow As Integer = 1
While intRow <= m_XlWrkSheet.UsedRange.Rows.Count
If m_XlApp.WorksheetFunction.CountA(m_XlWrkSheet.Cells(intRow, 1).EntireRow) = 0 Then
m_XlWrkSheet.Cells(intRow, 1).EntireRow.Delete(Excel.XlDeleteShiftDirection.xlShiftUp)
Else
intRow += 1
End If
End While
Dim intCol As Integer = 1
While intCol <= m_XlWrkSheet.UsedRange.Columns.Count
If m_XlApp.WorksheetFunction.CountA(m_XlWrkSheet.Cells(1, intCol).EntireColumn) = 0 Then
m_XlWrkSheet.Cells(1, intCol).EntireColumn.Delete(Excel.XlDeleteShiftDirection.xlShiftToLeft)
Else
intCol += 1
End If
End While
Next
m_xlWrkb.Save()
m_xlWrkb.Close(SaveChanges:=True)
Marshal.ReleaseComObject(m_xlWrkb)
Marshal.ReleaseComObject(m_xlWrkbs)
m_XlApp.Quit()
Marshal.ReleaseComObject(m_XlApp)
But when cleaning big excel files it takes a lot of time.
Any suggestions for optimizing this code? or another way to clean this excel files faster? Is there a function that can delete empty rows in one click?
I don't have problem if answers are using C#
EDIT:
I uploaded a sample file Sample File. But not all files have same structure.
I found that looping through the excel worksheet can take some time if the worksheet is large. So my solution tried to avoid any looping in the worksheet. To avoid looping through the worksheet, I made a 2 dimensional object array from the cells returned from usedRange with:
Excel.Range targetCells = worksheet.UsedRange;
object[,] allValues = (object[,])targetCells.Cells.Value;
This is the array I loop through to get the indexes of the empty rows and columns. I make 2 int lists, one keeps the row indexes to delete the other keeps the column indexes to delete.
List<int> emptyRows = GetEmptyRows(allValues, totalRows, totalCols);
List<int> emptyCols = GetEmptyCols(allValues, totalRows, totalCols);
These lists will be sorted from high to low to simplify deleting rows from the bottom up and deleting columns from right to left. Then simply loop through each list and delete the appropriate row/col.
DeleteRows(emptyRows, worksheet);
DeleteCols(emptyCols, worksheet);
Finally after all the empty rows and columns have been deleted, I SaveAs the file to a new file name.
Hope this helps.
EDIT:
Addressed the UsedRange issue such that if there are empty rows at the top of the worksheet, those rows will now be removed. Also this will remove any empty columns to the left of the starting data. This allows for the indexing to work properly even if there are empty rows or columns before the data starts.
This was accomplished by taking the address of the first cell in UsedRange this will be an address of the form “$A$1:$D$4”. This will allow the use of an offset if the empty rows at the top and empty columns to the left are to remain and not be deleted. In this case I am simply deleting them. To get the number of rows to delete from the top can be calculated by the first “$A$4” address where the “4” is the row that the first data appears. So we need to delete the top 3 rows. The Column address is of the form “A”, “AB” or even “AAD” this required some translation and thanks to How to convert a column number (eg. 127) into an excel column (eg. AA) I was able to determine how many columns on the left need to be deleted.
class Program {
static void Main(string[] args) {
Excel.Application excel = new Excel.Application();
string originalPath = #"H:\ExcelTestFolder\Book1_Test.xls";
Excel.Workbook workbook = excel.Workbooks.Open(originalPath);
Excel.Worksheet worksheet = workbook.Worksheets["Sheet1"];
Excel.Range usedRange = worksheet.UsedRange;
RemoveEmptyTopRowsAndLeftCols(worksheet, usedRange);
DeleteEmptyRowsCols(worksheet);
string newPath = #"H:\ExcelTestFolder\Book1_Test_Removed.xls";
workbook.SaveAs(newPath, Excel.XlSaveAsAccessMode.xlNoChange);
workbook.Close();
excel.Quit();
System.Runtime.InteropServices.Marshal.ReleaseComObject(workbook);
System.Runtime.InteropServices.Marshal.ReleaseComObject(excel);
Console.WriteLine("Finished removing empty rows and columns - Press any key to exit");
Console.ReadKey();
}
private static void DeleteEmptyRowsCols(Excel.Worksheet worksheet) {
Excel.Range targetCells = worksheet.UsedRange;
object[,] allValues = (object[,])targetCells.Cells.Value;
int totalRows = targetCells.Rows.Count;
int totalCols = targetCells.Columns.Count;
List<int> emptyRows = GetEmptyRows(allValues, totalRows, totalCols);
List<int> emptyCols = GetEmptyCols(allValues, totalRows, totalCols);
// now we have a list of the empty rows and columns we need to delete
DeleteRows(emptyRows, worksheet);
DeleteCols(emptyCols, worksheet);
}
private static void DeleteRows(List<int> rowsToDelete, Excel.Worksheet worksheet) {
// the rows are sorted high to low - so index's wont shift
foreach (int rowIndex in rowsToDelete) {
worksheet.Rows[rowIndex].Delete();
}
}
private static void DeleteCols(List<int> colsToDelete, Excel.Worksheet worksheet) {
// the cols are sorted high to low - so index's wont shift
foreach (int colIndex in colsToDelete) {
worksheet.Columns[colIndex].Delete();
}
}
private static List<int> GetEmptyRows(object[,] allValues, int totalRows, int totalCols) {
List<int> emptyRows = new List<int>();
for (int i = 1; i < totalRows; i++) {
if (IsRowEmpty(allValues, i, totalCols)) {
emptyRows.Add(i);
}
}
// sort the list from high to low
return emptyRows.OrderByDescending(x => x).ToList();
}
private static List<int> GetEmptyCols(object[,] allValues, int totalRows, int totalCols) {
List<int> emptyCols = new List<int>();
for (int i = 1; i < totalCols; i++) {
if (IsColumnEmpty(allValues, i, totalRows)) {
emptyCols.Add(i);
}
}
// sort the list from high to low
return emptyCols.OrderByDescending(x => x).ToList();
}
private static bool IsColumnEmpty(object[,] allValues, int colIndex, int totalRows) {
for (int i = 1; i < totalRows; i++) {
if (allValues[i, colIndex] != null) {
return false;
}
}
return true;
}
private static bool IsRowEmpty(object[,] allValues, int rowIndex, int totalCols) {
for (int i = 1; i < totalCols; i++) {
if (allValues[rowIndex, i] != null) {
return false;
}
}
return true;
}
private static void RemoveEmptyTopRowsAndLeftCols(Excel.Worksheet worksheet, Excel.Range usedRange) {
string addressString = usedRange.Address.ToString();
int rowsToDelete = GetNumberOfTopRowsToDelete(addressString);
DeleteTopEmptyRows(worksheet, rowsToDelete);
int colsToDelete = GetNumberOfLeftColsToDelte(addressString);
DeleteLeftEmptyColumns(worksheet, colsToDelete);
}
private static void DeleteTopEmptyRows(Excel.Worksheet worksheet, int startRow) {
for (int i = 0; i < startRow - 1; i++) {
worksheet.Rows[1].Delete();
}
}
private static void DeleteLeftEmptyColumns(Excel.Worksheet worksheet, int colCount) {
for (int i = 0; i < colCount - 1; i++) {
worksheet.Columns[1].Delete();
}
}
private static int GetNumberOfTopRowsToDelete(string address) {
string[] splitArray = address.Split(':');
string firstIndex = splitArray[0];
splitArray = firstIndex.Split('$');
string value = splitArray[2];
int returnValue = -1;
if ((int.TryParse(value, out returnValue)) && (returnValue >= 0))
return returnValue;
return returnValue;
}
private static int GetNumberOfLeftColsToDelte(string address) {
string[] splitArray = address.Split(':');
string firstindex = splitArray[0];
splitArray = firstindex.Split('$');
string value = splitArray[1];
return ParseColHeaderToIndex(value);
}
private static int ParseColHeaderToIndex(string colAdress) {
int[] digits = new int[colAdress.Length];
for (int i = 0; i < colAdress.Length; ++i) {
digits[i] = Convert.ToInt32(colAdress[i]) - 64;
}
int mul = 1; int res = 0;
for (int pos = digits.Length - 1; pos >= 0; --pos) {
res += digits[pos] * mul;
mul *= 26;
}
return res;
}
}
EDIT 2: For testing I made a method that loops thru the the worksheet and compared it to my code that loops thru an object array. It shows a significant difference.
Method to Loop thru the worksheet and delete empty rows and columns.
enum RowOrCol { Row, Column };
private static void ConventionalRemoveEmptyRowsCols(Excel.Worksheet worksheet) {
Excel.Range usedRange = worksheet.UsedRange;
int totalRows = usedRange.Rows.Count;
int totalCols = usedRange.Columns.Count;
RemoveEmpty(usedRange, RowOrCol.Row);
RemoveEmpty(usedRange, RowOrCol.Column);
}
private static void RemoveEmpty(Excel.Range usedRange, RowOrCol rowOrCol) {
int count;
Excel.Range curRange;
if (rowOrCol == RowOrCol.Column)
count = usedRange.Columns.Count;
else
count = usedRange.Rows.Count;
for (int i = count; i > 0; i--) {
bool isEmpty = true;
if (rowOrCol == RowOrCol.Column)
curRange = usedRange.Columns[i];
else
curRange = usedRange.Rows[i];
foreach (Excel.Range cell in curRange.Cells) {
if (cell.Value != null) {
isEmpty = false;
break; // we can exit this loop since the range is not empty
}
else {
// Cell value is null contiue checking
}
} // end loop thru each cell in this range (row or column)
if (isEmpty) {
curRange.Delete();
}
}
}
Then a Main for testing/timing the two methods.
enum RowOrCol { Row, Column };
static void Main(string[] args)
{
Excel.Application excel = new Excel.Application();
string originalPath = #"H:\ExcelTestFolder\Book1_Test.xls";
Excel.Workbook workbook = excel.Workbooks.Open(originalPath);
Excel.Worksheet worksheet = workbook.Worksheets["Sheet1"];
Excel.Range usedRange = worksheet.UsedRange;
// Start test for looping thru each excel worksheet
Stopwatch sw = new Stopwatch();
Console.WriteLine("Start stopwatch to loop thru WORKSHEET...");
sw.Start();
ConventionalRemoveEmptyRowsCols(worksheet);
sw.Stop();
Console.WriteLine("It took a total of: " + sw.Elapsed.Milliseconds + " Miliseconds to remove empty rows and columns...");
string newPath = #"H:\ExcelTestFolder\Book1_Test_RemovedLoopThruWorksheet.xls";
workbook.SaveAs(newPath, Excel.XlSaveAsAccessMode.xlNoChange);
workbook.Close();
Console.WriteLine("");
// Start test for looping thru object array
workbook = excel.Workbooks.Open(originalPath);
worksheet = workbook.Worksheets["Sheet1"];
usedRange = worksheet.UsedRange;
Console.WriteLine("Start stopwatch to loop thru object array...");
sw = new Stopwatch();
sw.Start();
DeleteEmptyRowsCols(worksheet);
sw.Stop();
// display results from second test
Console.WriteLine("It took a total of: " + sw.Elapsed.Milliseconds + " Miliseconds to remove empty rows and columns...");
string newPath2 = #"H:\ExcelTestFolder\Book1_Test_RemovedLoopThruArray.xls";
workbook.SaveAs(newPath2, Excel.XlSaveAsAccessMode.xlNoChange);
workbook.Close();
excel.Quit();
System.Runtime.InteropServices.Marshal.ReleaseComObject(workbook);
System.Runtime.InteropServices.Marshal.ReleaseComObject(excel);
Console.WriteLine("");
Console.WriteLine("Finished testing methods - Press any key to exit");
Console.ReadKey();
}
EDIT 3 As per OP request...
I updated and changed the code to match the OP code. With this I found some interesting results. See below.
I changed the code to match the functions you are using ie… EntireRow and CountA. The code below I found that it preforms terribly. Running some tests I found the code below was in the 800+ milliseconds execution time. However one subtle change made a huge difference.
On the line:
while (rowIndex <= worksheet.UsedRange.Rows.Count)
This is slowing things down a lot. If you create a range variable for UsedRang and not keep regrabbibg it with each iteration of the while loop will make a huge difference. So… when I change the while loop to…
Excel.Range usedRange = worksheet.UsedRange;
int rowIndex = 1;
while (rowIndex <= usedRange.Rows.Count)
and
while (colIndex <= usedRange.Columns.Count)
This performed very close to my object array solution. I did not post the results, as you can use the code below and change the while loop to grab the UsedRange with each iteration or use the variable usedRange to test this.
private static void RemoveEmptyRowsCols3(Excel.Worksheet worksheet) {
//Excel.Range usedRange = worksheet.UsedRange; // <- using this variable makes the while loop much faster
int rowIndex = 1;
// delete empty rows
//while (rowIndex <= usedRange.Rows.Count) // <- changing this one line makes a huge difference - not grabbibg the UsedRange with each iteration...
while (rowIndex <= worksheet.UsedRange.Rows.Count) {
if (excel.WorksheetFunction.CountA(worksheet.Cells[rowIndex, 1].EntireRow) == 0) {
worksheet.Cells[rowIndex, 1].EntireRow.Delete(Excel.XlDeleteShiftDirection.xlShiftUp);
}
else {
rowIndex++;
}
}
// delete empty columns
int colIndex = 1;
// while (colIndex <= usedRange.Columns.Count) // <- change here also
while (colIndex <= worksheet.UsedRange.Columns.Count) {
if (excel.WorksheetFunction.CountA(worksheet.Cells[1, colIndex].EntireColumn) == 0) {
worksheet.Cells[1, colIndex].EntireColumn.Delete(Excel.XlDeleteShiftDirection.xlShiftToLeft);
}
else {
colIndex++;
}
}
}
UPDATE by #Hadi
You can alter DeleteCols and DeleteRows function to get better performance if excel contains extra blank rows and columns after the last used ones:
private static void DeleteRows(List<int> rowsToDelete, Microsoft.Office.Interop.Excel.Worksheet worksheet)
{
// the rows are sorted high to low - so index's wont shift
List<int> NonEmptyRows = Enumerable.Range(1, rowsToDelete.Max()).ToList().Except(rowsToDelete).ToList();
if (NonEmptyRows.Max() < rowsToDelete.Max())
{
// there are empty rows after the last non empty row
Microsoft.Office.Interop.Excel.Range cell1 = worksheet.Cells[NonEmptyRows.Max() + 1,1];
Microsoft.Office.Interop.Excel.Range cell2 = worksheet.Cells[rowsToDelete.Max(), 1];
//Delete all empty rows after the last used row
worksheet.Range[cell1, cell2].EntireRow.Delete(Microsoft.Office.Interop.Excel.XlDeleteShiftDirection.xlShiftUp);
} //else last non empty row = worksheet.Rows.Count
foreach (int rowIndex in rowsToDelete.Where(x => x < NonEmptyRows.Max()))
{
worksheet.Rows[rowIndex].Delete();
}
}
private static void DeleteCols(List<int> colsToDelete, Microsoft.Office.Interop.Excel.Worksheet worksheet)
{
// the cols are sorted high to low - so index's wont shift
//Get non Empty Cols
List<int> NonEmptyCols = Enumerable.Range(1, colsToDelete.Max()).ToList().Except(colsToDelete).ToList();
if (NonEmptyCols.Max() < colsToDelete.Max())
{
// there are empty rows after the last non empty row
Microsoft.Office.Interop.Excel.Range cell1 = worksheet.Cells[1,NonEmptyCols.Max() + 1];
Microsoft.Office.Interop.Excel.Range cell2 = worksheet.Cells[1,NonEmptyCols.Max()];
//Delete all empty rows after the last used row
worksheet.Range[cell1, cell2].EntireColumn.Delete(Microsoft.Office.Interop.Excel.XlDeleteShiftDirection.xlShiftToLeft);
} //else last non empty column = worksheet.Columns.Count
foreach (int colIndex in colsToDelete.Where(x => x < NonEmptyCols.Max()))
{
worksheet.Columns[colIndex].Delete();
}
}
check my answer at Get Last non empty column and row index from excel using Interop
Maybe something to consider:
Sub usedRangeDeleteRowsCols()
Dim LastRow, LastCol, i As Long
LastRow = Cells.Find(What:="*", SearchDirection:=xlPrevious, SearchOrder:=xlByRows).Row
LastCol = Cells.Find(What:="*", SearchDirection:=xlPrevious, SearchOrder:=xlByColumns).Column
For i = LastRow To 1 Step -1
If WorksheetFunction.CountA(Range(Cells(i, 1), Cells(i, LastCol))) = 0 Then
Cells(i, 1).EntireRow.Delete
End If
Next
For i = LastCol To 1 Step -1
If WorksheetFunction.CountA(Range(Cells(1, i), Cells(LastRow, i))) = 0 Then
Cells(1, i).EntireColumn.Delete
End If
Next
End Sub
I think there are two efficiencies compared to equivalent functions in the original code. Firstly, instead of using Excel's unreliable UsedRange property, we find the last value and only scan rows and columns within the genuine used range.
Secondly the worksheet count function again only works within the genuine used range - for example when searching for blank rows we only look in the range of used columns (rather than .EntireRow).
The For loops work backwards because, for example, every time a row is deleted, the row address of following data changes. Working backwards means the row addresses of "data to be worked on" doesn't change.
In my opinion the most time consuming part could be enumerating and finding empty rows and columns.
What about:
http://www.howtogeek.com/206696/how-to-quickly-and-easily-delete-blank-rows-and-columns-in-excel-2013/
EDIT:
What about:
m_XlWrkSheet.Columns("A:A").SpecialCells(xlCellTypeBlanks).EntireRow.Delete
m_XlWrkSheet.Rows("1:1").SpecialCells(xlCellTypeBlanks).EntireColumn.Delete
Tested on sample data result looks ok, performance better (tested from VBA but difference is huge).
UPDATE:
Tested on sample Excel with 14k rows (made from sample data) original code ~30 s, this version <1s
The easiest way that I know of is to hide non-blank cells and delete the visible ones:
var range = m_XlWrkSheet.UsedRange;
range.SpecialCells(XlCellType.xlCellTypeConstants).EntireRow.Hidden = true;
range.SpecialCells(XlCellType.xlCellTypeVisible).Delete(XlDeleteShiftDirection.xlShiftUp);
range.EntireRow.Hidden = false;
Faster methods are to not delete anything at all, but to move (cut+paste) the non-blank areas.
The fastest Interop way (there are faster more complicated methods without opening the file) is to get all values in array, move the values in the array, and put the values back:
object[,] values = m_XlWrkSheet.UsedRange.Value2 as object[,];
// some code here (the values start from values[1, 1] not values[0, 0])
m_XlWrkSheet.UsedRange.Value2 = values;
You could open an ADO connection to the worksheet, get a list of fields, issue an SQL statement which includes only known fields, and also exclude records with no values in the known fields.
I have a lot of excel files that contains data and it contains empty rows and empty columns.
like shown bellow
I am trying to remove Empty rows and columns from excel using interop.
I create a simple winform application and used the following code and it works fine.
Dim lstFiles As New List(Of String)
lstFiles.AddRange(IO.Directory.GetFiles(m_strFolderPath, "*.xls", IO.SearchOption.AllDirectories))
Dim m_XlApp = New Excel.Application
Dim m_xlWrkbs As Excel.Workbooks = m_XlApp.Workbooks
Dim m_xlWrkb As Excel.Workbook
For Each strFile As String In lstFiles
m_xlWrkb = m_xlWrkbs.Open(strFile)
Dim m_XlWrkSheet As Excel.Worksheet = m_xlWrkb.Worksheets(1)
Dim intRow As Integer = 1
While intRow <= m_XlWrkSheet.UsedRange.Rows.Count
If m_XlApp.WorksheetFunction.CountA(m_XlWrkSheet.Cells(intRow, 1).EntireRow) = 0 Then
m_XlWrkSheet.Cells(intRow, 1).EntireRow.Delete(Excel.XlDeleteShiftDirection.xlShiftUp)
Else
intRow += 1
End If
End While
Dim intCol As Integer = 1
While intCol <= m_XlWrkSheet.UsedRange.Columns.Count
If m_XlApp.WorksheetFunction.CountA(m_XlWrkSheet.Cells(1, intCol).EntireColumn) = 0 Then
m_XlWrkSheet.Cells(1, intCol).EntireColumn.Delete(Excel.XlDeleteShiftDirection.xlShiftToLeft)
Else
intCol += 1
End If
End While
Next
m_xlWrkb.Save()
m_xlWrkb.Close(SaveChanges:=True)
Marshal.ReleaseComObject(m_xlWrkb)
Marshal.ReleaseComObject(m_xlWrkbs)
m_XlApp.Quit()
Marshal.ReleaseComObject(m_XlApp)
But when cleaning big excel files it takes a lot of time.
Any suggestions for optimizing this code? or another way to clean this excel files faster? Is there a function that can delete empty rows in one click?
I don't have problem if answers are using C#
EDIT:
I uploaded a sample file Sample File. But not all files have same structure.
I found that looping through the excel worksheet can take some time if the worksheet is large. So my solution tried to avoid any looping in the worksheet. To avoid looping through the worksheet, I made a 2 dimensional object array from the cells returned from usedRange with:
Excel.Range targetCells = worksheet.UsedRange;
object[,] allValues = (object[,])targetCells.Cells.Value;
This is the array I loop through to get the indexes of the empty rows and columns. I make 2 int lists, one keeps the row indexes to delete the other keeps the column indexes to delete.
List<int> emptyRows = GetEmptyRows(allValues, totalRows, totalCols);
List<int> emptyCols = GetEmptyCols(allValues, totalRows, totalCols);
These lists will be sorted from high to low to simplify deleting rows from the bottom up and deleting columns from right to left. Then simply loop through each list and delete the appropriate row/col.
DeleteRows(emptyRows, worksheet);
DeleteCols(emptyCols, worksheet);
Finally after all the empty rows and columns have been deleted, I SaveAs the file to a new file name.
Hope this helps.
EDIT:
Addressed the UsedRange issue such that if there are empty rows at the top of the worksheet, those rows will now be removed. Also this will remove any empty columns to the left of the starting data. This allows for the indexing to work properly even if there are empty rows or columns before the data starts.
This was accomplished by taking the address of the first cell in UsedRange this will be an address of the form “$A$1:$D$4”. This will allow the use of an offset if the empty rows at the top and empty columns to the left are to remain and not be deleted. In this case I am simply deleting them. To get the number of rows to delete from the top can be calculated by the first “$A$4” address where the “4” is the row that the first data appears. So we need to delete the top 3 rows. The Column address is of the form “A”, “AB” or even “AAD” this required some translation and thanks to How to convert a column number (eg. 127) into an excel column (eg. AA) I was able to determine how many columns on the left need to be deleted.
class Program {
static void Main(string[] args) {
Excel.Application excel = new Excel.Application();
string originalPath = #"H:\ExcelTestFolder\Book1_Test.xls";
Excel.Workbook workbook = excel.Workbooks.Open(originalPath);
Excel.Worksheet worksheet = workbook.Worksheets["Sheet1"];
Excel.Range usedRange = worksheet.UsedRange;
RemoveEmptyTopRowsAndLeftCols(worksheet, usedRange);
DeleteEmptyRowsCols(worksheet);
string newPath = #"H:\ExcelTestFolder\Book1_Test_Removed.xls";
workbook.SaveAs(newPath, Excel.XlSaveAsAccessMode.xlNoChange);
workbook.Close();
excel.Quit();
System.Runtime.InteropServices.Marshal.ReleaseComObject(workbook);
System.Runtime.InteropServices.Marshal.ReleaseComObject(excel);
Console.WriteLine("Finished removing empty rows and columns - Press any key to exit");
Console.ReadKey();
}
private static void DeleteEmptyRowsCols(Excel.Worksheet worksheet) {
Excel.Range targetCells = worksheet.UsedRange;
object[,] allValues = (object[,])targetCells.Cells.Value;
int totalRows = targetCells.Rows.Count;
int totalCols = targetCells.Columns.Count;
List<int> emptyRows = GetEmptyRows(allValues, totalRows, totalCols);
List<int> emptyCols = GetEmptyCols(allValues, totalRows, totalCols);
// now we have a list of the empty rows and columns we need to delete
DeleteRows(emptyRows, worksheet);
DeleteCols(emptyCols, worksheet);
}
private static void DeleteRows(List<int> rowsToDelete, Excel.Worksheet worksheet) {
// the rows are sorted high to low - so index's wont shift
foreach (int rowIndex in rowsToDelete) {
worksheet.Rows[rowIndex].Delete();
}
}
private static void DeleteCols(List<int> colsToDelete, Excel.Worksheet worksheet) {
// the cols are sorted high to low - so index's wont shift
foreach (int colIndex in colsToDelete) {
worksheet.Columns[colIndex].Delete();
}
}
private static List<int> GetEmptyRows(object[,] allValues, int totalRows, int totalCols) {
List<int> emptyRows = new List<int>();
for (int i = 1; i < totalRows; i++) {
if (IsRowEmpty(allValues, i, totalCols)) {
emptyRows.Add(i);
}
}
// sort the list from high to low
return emptyRows.OrderByDescending(x => x).ToList();
}
private static List<int> GetEmptyCols(object[,] allValues, int totalRows, int totalCols) {
List<int> emptyCols = new List<int>();
for (int i = 1; i < totalCols; i++) {
if (IsColumnEmpty(allValues, i, totalRows)) {
emptyCols.Add(i);
}
}
// sort the list from high to low
return emptyCols.OrderByDescending(x => x).ToList();
}
private static bool IsColumnEmpty(object[,] allValues, int colIndex, int totalRows) {
for (int i = 1; i < totalRows; i++) {
if (allValues[i, colIndex] != null) {
return false;
}
}
return true;
}
private static bool IsRowEmpty(object[,] allValues, int rowIndex, int totalCols) {
for (int i = 1; i < totalCols; i++) {
if (allValues[rowIndex, i] != null) {
return false;
}
}
return true;
}
private static void RemoveEmptyTopRowsAndLeftCols(Excel.Worksheet worksheet, Excel.Range usedRange) {
string addressString = usedRange.Address.ToString();
int rowsToDelete = GetNumberOfTopRowsToDelete(addressString);
DeleteTopEmptyRows(worksheet, rowsToDelete);
int colsToDelete = GetNumberOfLeftColsToDelte(addressString);
DeleteLeftEmptyColumns(worksheet, colsToDelete);
}
private static void DeleteTopEmptyRows(Excel.Worksheet worksheet, int startRow) {
for (int i = 0; i < startRow - 1; i++) {
worksheet.Rows[1].Delete();
}
}
private static void DeleteLeftEmptyColumns(Excel.Worksheet worksheet, int colCount) {
for (int i = 0; i < colCount - 1; i++) {
worksheet.Columns[1].Delete();
}
}
private static int GetNumberOfTopRowsToDelete(string address) {
string[] splitArray = address.Split(':');
string firstIndex = splitArray[0];
splitArray = firstIndex.Split('$');
string value = splitArray[2];
int returnValue = -1;
if ((int.TryParse(value, out returnValue)) && (returnValue >= 0))
return returnValue;
return returnValue;
}
private static int GetNumberOfLeftColsToDelte(string address) {
string[] splitArray = address.Split(':');
string firstindex = splitArray[0];
splitArray = firstindex.Split('$');
string value = splitArray[1];
return ParseColHeaderToIndex(value);
}
private static int ParseColHeaderToIndex(string colAdress) {
int[] digits = new int[colAdress.Length];
for (int i = 0; i < colAdress.Length; ++i) {
digits[i] = Convert.ToInt32(colAdress[i]) - 64;
}
int mul = 1; int res = 0;
for (int pos = digits.Length - 1; pos >= 0; --pos) {
res += digits[pos] * mul;
mul *= 26;
}
return res;
}
}
EDIT 2: For testing I made a method that loops thru the the worksheet and compared it to my code that loops thru an object array. It shows a significant difference.
Method to Loop thru the worksheet and delete empty rows and columns.
enum RowOrCol { Row, Column };
private static void ConventionalRemoveEmptyRowsCols(Excel.Worksheet worksheet) {
Excel.Range usedRange = worksheet.UsedRange;
int totalRows = usedRange.Rows.Count;
int totalCols = usedRange.Columns.Count;
RemoveEmpty(usedRange, RowOrCol.Row);
RemoveEmpty(usedRange, RowOrCol.Column);
}
private static void RemoveEmpty(Excel.Range usedRange, RowOrCol rowOrCol) {
int count;
Excel.Range curRange;
if (rowOrCol == RowOrCol.Column)
count = usedRange.Columns.Count;
else
count = usedRange.Rows.Count;
for (int i = count; i > 0; i--) {
bool isEmpty = true;
if (rowOrCol == RowOrCol.Column)
curRange = usedRange.Columns[i];
else
curRange = usedRange.Rows[i];
foreach (Excel.Range cell in curRange.Cells) {
if (cell.Value != null) {
isEmpty = false;
break; // we can exit this loop since the range is not empty
}
else {
// Cell value is null contiue checking
}
} // end loop thru each cell in this range (row or column)
if (isEmpty) {
curRange.Delete();
}
}
}
Then a Main for testing/timing the two methods.
enum RowOrCol { Row, Column };
static void Main(string[] args)
{
Excel.Application excel = new Excel.Application();
string originalPath = #"H:\ExcelTestFolder\Book1_Test.xls";
Excel.Workbook workbook = excel.Workbooks.Open(originalPath);
Excel.Worksheet worksheet = workbook.Worksheets["Sheet1"];
Excel.Range usedRange = worksheet.UsedRange;
// Start test for looping thru each excel worksheet
Stopwatch sw = new Stopwatch();
Console.WriteLine("Start stopwatch to loop thru WORKSHEET...");
sw.Start();
ConventionalRemoveEmptyRowsCols(worksheet);
sw.Stop();
Console.WriteLine("It took a total of: " + sw.Elapsed.Milliseconds + " Miliseconds to remove empty rows and columns...");
string newPath = #"H:\ExcelTestFolder\Book1_Test_RemovedLoopThruWorksheet.xls";
workbook.SaveAs(newPath, Excel.XlSaveAsAccessMode.xlNoChange);
workbook.Close();
Console.WriteLine("");
// Start test for looping thru object array
workbook = excel.Workbooks.Open(originalPath);
worksheet = workbook.Worksheets["Sheet1"];
usedRange = worksheet.UsedRange;
Console.WriteLine("Start stopwatch to loop thru object array...");
sw = new Stopwatch();
sw.Start();
DeleteEmptyRowsCols(worksheet);
sw.Stop();
// display results from second test
Console.WriteLine("It took a total of: " + sw.Elapsed.Milliseconds + " Miliseconds to remove empty rows and columns...");
string newPath2 = #"H:\ExcelTestFolder\Book1_Test_RemovedLoopThruArray.xls";
workbook.SaveAs(newPath2, Excel.XlSaveAsAccessMode.xlNoChange);
workbook.Close();
excel.Quit();
System.Runtime.InteropServices.Marshal.ReleaseComObject(workbook);
System.Runtime.InteropServices.Marshal.ReleaseComObject(excel);
Console.WriteLine("");
Console.WriteLine("Finished testing methods - Press any key to exit");
Console.ReadKey();
}
EDIT 3 As per OP request...
I updated and changed the code to match the OP code. With this I found some interesting results. See below.
I changed the code to match the functions you are using ie… EntireRow and CountA. The code below I found that it preforms terribly. Running some tests I found the code below was in the 800+ milliseconds execution time. However one subtle change made a huge difference.
On the line:
while (rowIndex <= worksheet.UsedRange.Rows.Count)
This is slowing things down a lot. If you create a range variable for UsedRang and not keep regrabbibg it with each iteration of the while loop will make a huge difference. So… when I change the while loop to…
Excel.Range usedRange = worksheet.UsedRange;
int rowIndex = 1;
while (rowIndex <= usedRange.Rows.Count)
and
while (colIndex <= usedRange.Columns.Count)
This performed very close to my object array solution. I did not post the results, as you can use the code below and change the while loop to grab the UsedRange with each iteration or use the variable usedRange to test this.
private static void RemoveEmptyRowsCols3(Excel.Worksheet worksheet) {
//Excel.Range usedRange = worksheet.UsedRange; // <- using this variable makes the while loop much faster
int rowIndex = 1;
// delete empty rows
//while (rowIndex <= usedRange.Rows.Count) // <- changing this one line makes a huge difference - not grabbibg the UsedRange with each iteration...
while (rowIndex <= worksheet.UsedRange.Rows.Count) {
if (excel.WorksheetFunction.CountA(worksheet.Cells[rowIndex, 1].EntireRow) == 0) {
worksheet.Cells[rowIndex, 1].EntireRow.Delete(Excel.XlDeleteShiftDirection.xlShiftUp);
}
else {
rowIndex++;
}
}
// delete empty columns
int colIndex = 1;
// while (colIndex <= usedRange.Columns.Count) // <- change here also
while (colIndex <= worksheet.UsedRange.Columns.Count) {
if (excel.WorksheetFunction.CountA(worksheet.Cells[1, colIndex].EntireColumn) == 0) {
worksheet.Cells[1, colIndex].EntireColumn.Delete(Excel.XlDeleteShiftDirection.xlShiftToLeft);
}
else {
colIndex++;
}
}
}
UPDATE by #Hadi
You can alter DeleteCols and DeleteRows function to get better performance if excel contains extra blank rows and columns after the last used ones:
private static void DeleteRows(List<int> rowsToDelete, Microsoft.Office.Interop.Excel.Worksheet worksheet)
{
// the rows are sorted high to low - so index's wont shift
List<int> NonEmptyRows = Enumerable.Range(1, rowsToDelete.Max()).ToList().Except(rowsToDelete).ToList();
if (NonEmptyRows.Max() < rowsToDelete.Max())
{
// there are empty rows after the last non empty row
Microsoft.Office.Interop.Excel.Range cell1 = worksheet.Cells[NonEmptyRows.Max() + 1,1];
Microsoft.Office.Interop.Excel.Range cell2 = worksheet.Cells[rowsToDelete.Max(), 1];
//Delete all empty rows after the last used row
worksheet.Range[cell1, cell2].EntireRow.Delete(Microsoft.Office.Interop.Excel.XlDeleteShiftDirection.xlShiftUp);
} //else last non empty row = worksheet.Rows.Count
foreach (int rowIndex in rowsToDelete.Where(x => x < NonEmptyRows.Max()))
{
worksheet.Rows[rowIndex].Delete();
}
}
private static void DeleteCols(List<int> colsToDelete, Microsoft.Office.Interop.Excel.Worksheet worksheet)
{
// the cols are sorted high to low - so index's wont shift
//Get non Empty Cols
List<int> NonEmptyCols = Enumerable.Range(1, colsToDelete.Max()).ToList().Except(colsToDelete).ToList();
if (NonEmptyCols.Max() < colsToDelete.Max())
{
// there are empty rows after the last non empty row
Microsoft.Office.Interop.Excel.Range cell1 = worksheet.Cells[1,NonEmptyCols.Max() + 1];
Microsoft.Office.Interop.Excel.Range cell2 = worksheet.Cells[1,NonEmptyCols.Max()];
//Delete all empty rows after the last used row
worksheet.Range[cell1, cell2].EntireColumn.Delete(Microsoft.Office.Interop.Excel.XlDeleteShiftDirection.xlShiftToLeft);
} //else last non empty column = worksheet.Columns.Count
foreach (int colIndex in colsToDelete.Where(x => x < NonEmptyCols.Max()))
{
worksheet.Columns[colIndex].Delete();
}
}
check my answer at Get Last non empty column and row index from excel using Interop
Maybe something to consider:
Sub usedRangeDeleteRowsCols()
Dim LastRow, LastCol, i As Long
LastRow = Cells.Find(What:="*", SearchDirection:=xlPrevious, SearchOrder:=xlByRows).Row
LastCol = Cells.Find(What:="*", SearchDirection:=xlPrevious, SearchOrder:=xlByColumns).Column
For i = LastRow To 1 Step -1
If WorksheetFunction.CountA(Range(Cells(i, 1), Cells(i, LastCol))) = 0 Then
Cells(i, 1).EntireRow.Delete
End If
Next
For i = LastCol To 1 Step -1
If WorksheetFunction.CountA(Range(Cells(1, i), Cells(LastRow, i))) = 0 Then
Cells(1, i).EntireColumn.Delete
End If
Next
End Sub
I think there are two efficiencies compared to equivalent functions in the original code. Firstly, instead of using Excel's unreliable UsedRange property, we find the last value and only scan rows and columns within the genuine used range.
Secondly the worksheet count function again only works within the genuine used range - for example when searching for blank rows we only look in the range of used columns (rather than .EntireRow).
The For loops work backwards because, for example, every time a row is deleted, the row address of following data changes. Working backwards means the row addresses of "data to be worked on" doesn't change.
In my opinion the most time consuming part could be enumerating and finding empty rows and columns.
What about:
http://www.howtogeek.com/206696/how-to-quickly-and-easily-delete-blank-rows-and-columns-in-excel-2013/
EDIT:
What about:
m_XlWrkSheet.Columns("A:A").SpecialCells(xlCellTypeBlanks).EntireRow.Delete
m_XlWrkSheet.Rows("1:1").SpecialCells(xlCellTypeBlanks).EntireColumn.Delete
Tested on sample data result looks ok, performance better (tested from VBA but difference is huge).
UPDATE:
Tested on sample Excel with 14k rows (made from sample data) original code ~30 s, this version <1s
The easiest way that I know of is to hide non-blank cells and delete the visible ones:
var range = m_XlWrkSheet.UsedRange;
range.SpecialCells(XlCellType.xlCellTypeConstants).EntireRow.Hidden = true;
range.SpecialCells(XlCellType.xlCellTypeVisible).Delete(XlDeleteShiftDirection.xlShiftUp);
range.EntireRow.Hidden = false;
Faster methods are to not delete anything at all, but to move (cut+paste) the non-blank areas.
The fastest Interop way (there are faster more complicated methods without opening the file) is to get all values in array, move the values in the array, and put the values back:
object[,] values = m_XlWrkSheet.UsedRange.Value2 as object[,];
// some code here (the values start from values[1, 1] not values[0, 0])
m_XlWrkSheet.UsedRange.Value2 = values;
You could open an ADO connection to the worksheet, get a list of fields, issue an SQL statement which includes only known fields, and also exclude records with no values in the known fields.
I would like to format all added values in my excel file and i have a "small" and "fast" solution like this:
Item2 is a List<string>, Item3 is a List<List<string>>
if (chkWithValues.Checked && results.Item3.Any())
{
var rows = results.Item3.Count;
var cols = results.Item3.Max(x => x.Count);
object[,] values = new object[rows, cols];
object[,] format = new object[rows, cols];
//All returned items are inserted into the Excel file
//Item2 contains the database types, Item3 the Values
// pgMain shows the progress for the selected tables
for (int j = 0; j < results.Item3.Count(); j++)
{
int tmpNbr = 1;
foreach (string value in results.Item3[j])
{
values[j, tmpNbr - 1] = Converter.Convert(results.Item2[tmpNbr - 1], value).ToString().Replace("'", "");
format[j, tmpNbr - 1] = ExcelColumnTypes.ConvertToExcelTypes(results.Item2[tmpNbr - 1]);
tmpNbr++;
}
pgMain.Maximum = results.Item3.Count();
pgMain.PerformStep();
}
Excel.Range range = xlWorksheet.Range["A3", GetExcelColumnName(cols) + (rows + 2)];
range.Value = values;
range.NumberFormat = format;
}
To add the numberformat efficiently with a single assignment, I've found a solution with a 2d array, which contains all number formats that should be set.
The problem is that I get the following error message "Unable to set the NumberFormat property of the Range class" when I have more than (i think) 50.000 cells to format.
Does someone know a solution that is fast and can handle a large amount of cells without error?
update:
ExcelColumnTypes.ConvertToExcelTypes
public static string ConvertToExcelTypes(string databaseType)
{
if (DatabaseColumnTypes.DOUBLE.Contains(databaseType))
return DOUBLEPO1;
if (DatabaseColumnTypes.DATE.Contains(databaseType))
return DATE2;
if (DatabaseColumnTypes.INTEGER.Contains(databaseType))
return INT;
return TEXT;
}
The DatabaseColumnTypes are List with const or direct const.
Sample:
public const string VARBINARY = "varbinary";
public static List<string> STRING = new List<string>()
{
CHAR,
VARCHAR,
TEXT,
NTEXT,
NCHAR,
NVARCHAR,
BINARY,
VARBINARY
};
Please change the
range.NumberFormat = format;
range.NumberFormatLocal = format;
Then it should work
I have the following routine that dumps a DataTable into an Excel worksheet.
private void RenderDataTableOnXlSheet(DataTable dt, Excel.Worksheet xlWk,
string [] columnNames, string [] fieldNames)
{
// render the column names (e.g. headers)
for (int i = 0; i < columnNames.Length; i++)
xlWk.Cells[1, i + 1] = columnNames[i];
// render the data
for (int i = 0; i < fieldNames.Length; i++)
{
for (int j = 0; j < dt.Rows.Count; j++)
{
xlWk.Cells[j + 2, i + 1] = dt.Rows[j][fieldNames[i]].ToString();
}
}
}
For whatever reason, dumping DataTable of 25 columns and 400 rows takes about 10-15 seconds on my relatively modern PC. Takes even longer testers' machines.
Is there anything I can do to speed up this code? Or is interop just inherently slow?
SOLUTION: Based on suggestions from Helen Toomik, I've modified the method and it should now work for several common data types (int32, double, datetime, string). Feel free to extend it. The speed for processing my dataset went from 15 seconds to under 1.
private void RenderDataTableOnXlSheet(DataTable dt, Excel.Worksheet xlWk, string [] columnNames, string [] fieldNames)
{
Excel.Range rngExcel = null;
Excel.Range headerRange = null;
try
{
// render the column names (e.g. headers)
for (int i = 0; i < columnNames.Length; i++)
xlWk.Cells[1, i + 1] = columnNames[i];
// for each column, create an array and set the array
// to the excel range for that column.
for (int i = 0; i < fieldNames.Length; i++)
{
string[,] clnDataString = new string[dt.Rows.Count, 1];
int[,] clnDataInt = new int[dt.Rows.Count, 1];
double[,] clnDataDouble = new double[dt.Rows.Count, 1];
string columnLetter = char.ConvertFromUtf32("A".ToCharArray()[0] + i);
rngExcel = xlWk.get_Range(columnLetter + "2", Missing.Value);
rngExcel = rngExcel.get_Resize(dt.Rows.Count, 1);
string dataTypeName = dt.Columns[fieldNames[i]].DataType.Name;
for (int j = 0; j < dt.Rows.Count; j++)
{
if (fieldNames[i].Length > 0)
{
switch (dataTypeName)
{
case "Int32":
clnDataInt[j, 0] = Convert.ToInt32(dt.Rows[j][fieldNames[i]]);
break;
case "Double":
clnDataDouble[j, 0] = Convert.ToDouble(dt.Rows[j][fieldNames[i]]);
break;
case "DateTime":
if (fieldNames[i].ToLower().Contains("time"))
clnDataString[j, 0] = Convert.ToDateTime(dt.Rows[j][fieldNames[i]]).ToShortTimeString();
else if (fieldNames[i].ToLower().Contains("date"))
clnDataString[j, 0] = Convert.ToDateTime(dt.Rows[j][fieldNames[i]]).ToShortDateString();
else
clnDataString[j, 0] = Convert.ToDateTime(dt.Rows[j][fieldNames[i]]).ToString();
break;
default:
clnDataString[j, 0] = dt.Rows[j][fieldNames[i]].ToString();
break;
}
}
else
clnDataString[j, 0] = string.Empty;
}
// set values in the sheet wholesale.
if (dataTypeName == "Int32")
rngExcel.set_Value(Missing.Value, clnDataInt);
else if (dataTypeName == "Double")
rngExcel.set_Value(Missing.Value, clnDataDouble);
else
rngExcel.set_Value(Missing.Value, clnDataString);
}
// figure out the letter of the last column (supports 1 letter column names)
string lastColumn = char.ConvertFromUtf32("A".ToCharArray()[0] + columnNames.Length - 1);
// make the header range bold
headerRange = xlWk.get_Range("A1", lastColumn + "1");
headerRange.Font.Bold = true;
// autofit for better view
xlWk.Columns.AutoFit();
}
finally
{
ReleaseObject(headerRange);
ReleaseObject(rngExcel);
}
}
private void ReleaseObject(object obj)
{
try
{
System.Runtime.InteropServices.Marshal.ReleaseComObject(obj);
obj = null;
}
catch
{
obj = null;
}
finally
{
GC.Collect();
}
}
Instead of setting cell values one by one, do it in a batch.
Step 1. Transfer the data from your DataTable into an array with the same dimensions.
Step 2. Define an Excel Range object that spans the appropriate range.
Step 3. Set the Range.Value to the array.
This will be a lot faster because you will have a total two calls across the Interop boundary (one to get the Range object, one to set its value), instead of two per cell (get cell, set value).
There is some sample code at MSDN KB article 302096.
Interop is inherently very slow.
There is a large overhead associated with each call.
To speed it up try writing back an object array of data to a range of cells in one assignment statement.
Or if this is a serious problem try using one of the Managed Code Excel extensions that can read/write data using managed code via the XLL interface. (Addin Express, Managed XLL etc.)
If you have a recordset, the fastest way to write to Excel is CopyFromRecordset.
Do you have a specific requirement to go the COM automation route? If not, you have a few other options.
Use the OLEDB provider to create/write to an Excel file http://support.microsoft.com/kb/316934
Use a third party library to write to Excel. Depending on your licensing requirements there are a few options.
Update: A good free library is NPOI http://npoi.codeplex.com/
Write the data to a csv file, and load that into Excel
Write the data as XML which can be loaded into Excel.
Use the Open XML SDK
http://www.microsoft.com/downloads/details.aspx?familyid=C6E744E5-36E9-45F5-8D8C-331DF206E0D0&displaylang=en
Interop has the fastest method called CopyFromRecordset
but ADODB library has to be used
Definitely the fastest way/method and I have tried a few. Perhaps, not easy to use but the speed is astonishing:
https://learn.microsoft.com/en-us/office/vba/api/excel.range.copyfromrecordset
a short sample:
using ADODB;
using Microsoft.Office.Interop;
//--- datatable --- already exists
DataTable dt_data = new DataTable();
//--- or your dt code is here ..........
//--- mine has 3 columns ------
//--- code to populate ADO rs with DataTable data --- nothing special
//--- create empty rs .....
ADODB.Recordset rs = new ADODB.Recordset();
rs.CursorType = CursorTypeEnum.adOpenKeyset;
rs.CursorLocation = CursorLocationEnum.adUseClient;
rs.LockType = LockTypeEnum.adLockOptimistic;
rs.Fields.Append("employee_id",DataTypeEnum.adBSTR,255,FieldAttributeEnum.adFldIsNullable);
rs.Fields.Append("full_name", DataTypeEnum.adBSTR, 255, FieldAttributeEnum.adFldIsNullable);
rs.Fields.Append("start_date", DataTypeEnum.adBSTR, 10, FieldAttributeEnum.adFldIsNullable);
rs.Open();
//--- populate ADO rs with DataTable data ----
for (int i = 0; i < dt_data.Rows.Count; i++)
{
rs.AddNew();
rs.Fields["employee_id"].Value = dt_data.Rows[i]["employee_id"].ToString();
rs.Fields["full_name"].Value = dt_data.Rows[i]["full_name"].ToString();
//--- if date is empty......
if (dt_data.Rows[i]["start_date"].ToString().Length > 0)
{
rs.Fields["start_date"].Value = dt_data.Rows[i]["start_date"].ToString();
}
rs.Update();
}
Microsoft.Office.Interop.Excel.Application xlexcel;
Microsoft.Office.Interop.Excel.Workbook xlWorkBook;
Microsoft.Office.Interop.Excel.Worksheet xlWorkSheet;
object misValue = System.Reflection.Missing.Value;
xlexcel = new Microsoft.Office.Interop.Excel.Application();
xlexcel.Visible = true;
xlWorkBook = xlexcel.Workbooks.Add(misValue);
xlWorkSheet = (Microsoft.Office.Interop.Excel.Worksheet)xlWorkBook.Worksheets.get_Item(1);
//--- populate columns from rs --
for (int i = 0; i < rs.Fields.Count; i++)
{
xlWorkSheet.Cells[1, i + 1] = rs.Fields[i].Name.ToString();
};
//----- .CopyFromRecordset method -- (rs object, MaxRows, MaxColumns) --- in this case 3 columns but it can 1,2,3 etc ------
xlWorkSheet.Cells[2, 1].CopyFromRecordset(CloneFilteredRecordset(rs), rs.RecordCount, 3);
You could create an Excel add-in, with VBA code to do all your db heavy lifting. from .NET, all you'd need to do is instantiate Excel, add the add-in, and call the Excel VBA routine, passing any parameters to it that it needs to execute your SQL statements.
I agree with Charles. Interop is really slow. But try this:
private void RenderDataTableOnXlSheet(DataTable dt, Excel.Worksheet xlWk,
string [] columnNames, string [] fieldNames)
{
// render the column names (e.g. headers)
int columnLength = columnNames.Length;
for (int i = 0; i < columnLength; i++)
xlWk.Cells[1, i + 1] = columnNames[i];
// render the data
int fieldLength = fieldNames.Length;
int rowCount = dt.Rows.Count;
for (int j = 0; j < rowCount; j++)
{
for (int i = 0; i < fieldLength; i++)
{
xlWk.Cells[j + 2, i + 1] = dt.Rows[j][fieldNames[i]].ToString();
}
}
}
HTH
I have an array in c# that is 1-based (generated from a call to get_Value for an Excel Range
I get a 2D array for example
object[,] ExcelData = (object[,]) MySheet.UsedRange.get_Value(Excel.XlRangeValueDataType.xlRangeValueDefault);
this appears as an array for example ExcelData[1..20,1..5]
is there any way to tell the compiler to rebase this so that I do not need to add 1 to loop counters the whole time?
List<string> RowHeadings = new List<string>();
string [,] Results = new string[MaxRows, 1]
for (int Row = 0; Row < MaxRows; Row++) {
if (ExcelData[Row+1, 1] != null)
RowHeadings.Add(ExcelData[Row+1, 1]);
...
...
Results[Row, 0] = ExcelData[Row+1, 1];
& other stuff in here that requires a 0-based Row
}
It makes things less readable since when creating an array for writing the array will be zero based.
Why not just change your index?
List<string> RowHeadings = new List<string>();
for (int Row = 1; Row <= MaxRows; Row++) {
if (ExcelData[Row, 1] != null)
RowHeadings.Add(ExcelData[Row, 1]);
}
Edit: Here is an extension method that would create a new, zero-based array from your original one (basically it just creates a new array that is one element smaller and copies to that new array all elements but the first element that you are currently skipping anyhow):
public static T[] ToZeroBasedArray<T>(this T[] array)
{
int len = array.Length - 1;
T[] newArray = new T[len];
Array.Copy(array, 1, newArray, 0, len);
return newArray;
}
That being said you need to consider if the penalty (however slight) of creating a new array is worth improving the readability of the code. I am not making a judgment (it very well may be worth it) I am just making sure you don't run with this code if it will hurt the performance of your application.
Create a wrapper for the ExcelData array with a this[,] indexer and do rebasing logic there. Something like:
class ExcelDataWrapper
{
private object[,] _excelData;
public ExcelDataWrapper(object[,] excelData)
{
_excelData = excelData;
}
public object this[int x, int y]
{
return _excelData[x+1, y+1];
}
}
Since you need Row to remain as-is (based on your comments), you could just introduce another loop variable:
List<string> RowHeadings = new List<string>();
string [,] Results = new string[MaxRows, 1]
for (int Row = 0, SrcRow = 1; SrcRow <= MaxRows; Row++, SrcRow++) {
if (ExcelData[SrcRow, 1] != null)
RowHeadings.Add(ExcelData[SrcRow, 1]);
...
...
Results[Row, 0] = ExcelData[SrcRow, 1];
}
Why not use:
for (int Row = 1; Row <= MaxRows; Row++) {
Or is there something I'm missing?
EDIT: as it turns out that something is missing, I would use another counter (starting at 0) for that purpose, and use a 1 based Row index for the array. It's not good practice to use the index for another use than the index in the target array.
Is changing the loop counter too hard for you?
for (int Row = 1; Row <= MaxRows; Row++)
If the counter's range is right, you don't have to add 1 to anything inside the loop so you don't lose readability. Keep it simple.
I agree that working with base-1 arrays from .NET can be a hassle. It is also potentially error-prone, as you have to mentally make a shift each time you use it, as well as correctly remember which situations will be base 1 and which will be base 0.
The most performant approach is to simply make these mental shifts and index appropriately, using base-1 or base-0 as required.
I personally prefer to convert the two dimensional base-1 arrays to two dimensional base-0 arrays. This, unfortunately, requires the performance hit of copying over the array to a new array, as there is no way to re-base an array in place.
Here's an extension method that can do this for the 2D arrays returned by Excel:
public static TResult[,] CloneBase0<TSource, TResult>(
this TSource[,] sourceArray)
{
If (sourceArray == null)
{
throw new ArgumentNullException(
"The 'sourceArray' is null, which is invalid.");
}
int numRows = sourceArray.GetLength(0);
int numColumns = sourceArray.GetLength(1);
TResult[,] resultArray = new TResult[numRows, numColumns];
int lb1 = sourceArray.GetLowerBound(0);
int lb2 = sourceArray.GetLowerBound(1);
for (int r = 0; r < numRows; r++)
{
for (int c = 0; c < numColumns; c++)
{
resultArray[r, c] = sourceArray[lb1 + r, lb2 + c];
}
}
return resultArray;
}
And then you can use it like this:
object[,] array2DBase1 = (object[,]) MySheet.UsedRange.get_Value(Type.Missing);
object[,] array2DBase0 = array2DBase1.CloneBase0();
for (int row = 0; row < array2DBase0.GetLength(0); row++)
{
for (int column = 0; column < array2DBase0.GetLength(1); column++)
{
// Your code goes here...
}
}
For massively sized arrays, you might not want to do this, but I find that, in general, it really cleans up your code (and mind-set) to make this conversion, and then always work in base-0.
Hope this helps...
Mike
For 1 based arrays and Excel range operations as well as UDF (SharePoint) functions I use this utility function
public static object[,] ToObjectArray(this Object Range)
{
Type type = Range.GetType();
if (type.IsArray && type.Name == "Object[,]")
{
var sourceArray = Range as Object[,];
int lb1 = sourceArray.GetLowerBound(0);
int lb2 = sourceArray.GetLowerBound(1);
if (lb1 == 0 && lb2 == 0)
{
return sourceArray;
}
else
{
int numRows = sourceArray.GetLength(0);
int numColumns = sourceArray.GetLength(1);
var resultArray = new Object[numRows, numColumns];
for (int r = 0; r < numRows; r++)
{
for (int c = 0; c < numColumns; c++)
{
resultArray[r, c] = sourceArray[lb1 + r, lb2 + c];
}
}
return resultArray;
}
}
else if (type.IsCOMObject)
{
// Get the Value2 property from the object.
Object value = type.InvokeMember("Value2",
System.Reflection.BindingFlags.Instance |
System.Reflection.BindingFlags.Public |
System.Reflection.BindingFlags.GetProperty,
null,
Range,
null);
if (value == null)
value = string.Empty;
if (value is string)
return new object[,] { { value } };
else if (value is double)
return new object[,] { { value } };
else
{
object[,] range = (object[,])value;
int rows = range.GetLength(0);
int columns = range.GetLength(1);
object[,] param = new object[rows, columns];
Array.Copy(range, param, rows * columns);
return param;
}
}
else
throw new ArgumentException("Not A Excel Range Com Object");
}
Usage
public object[,] RemoveZeros(object range)
{
return this.RemoveZeros(range.ToObjectArray());
}
[ComVisible(false)]
[UdfMethod(IsVolatile = false)]
public object[,] RemoveZeros(Object[,] range)
{...}
The first function is com visible and will accept an excel range or a chained call from another function (the chained call will return a 1 based object array), the second call is UDF enabled for Excel Services in SharePoint. All of the logic is in the second function. In this example we are just reformatting a range to replace zero with string.empty.
You could use a 3rd party Excel compatible component such as SpreadsheetGear for .NET which has .NET friendly APIs - including 0 based indexing for APIs such as IRange[int rowIndex, int colIndex].
Such components will also be much faster than the Excel API in almost all cases.
Disclaimer: I own SpreadsheetGear LLC