Reading a full table from Excel using Open XML ...FASTER

Reading a full table from Excel using Open XML ...FASTER - c#

Warning: long post due to examples and results
There have been threads here about how to read an Open XML spreadsheet row with null cells in between columns. I drew some of my answers from here reading Excel Open XML is ignoring blank cells
I'm able to read a table just fine from xlsx, but it is 10 times slower than reading from CSV, while the open XML structure should(?) yield superior results.
Here's what I got for testing code base:
foreach (Row r in sheetData.Descendants<Row>())
{
sw.Start();
//find a row marked as "header" and get list of columns that define width of table
if (!headerRowFound)
{
headerRowFound = CheckOXMLHeaderRow(r, workbookPart, out headerReferences);
if (!headerRowFound)
continue;
}
rowKey++;
//////////////////////////////////////////////////////////////////////
///////////////////here we are going to do work//////////////////////
////////////////////////////////////////////////////////////////////
AddRow(rowKey, cols);
sw.Stop();
Debug.WriteLine("XLSX Row added in \t" + sw.ElapsedTicks.ToString() + "\tticks");
sw.Reset();
}
In my data a row is 68 cells, with only 5-10 of them filled out
0. For comparison, going through CSV rows takes about 300 ticks (lightning fast). 5000 rows adds in 3ms
1. Code as is processes through row enumerators only in 1-4 ticks
2. This code simply grabs all cells sequentially and stores them in a row (column order is screwed up due to OXML nature)
Hashtable cols = new Hashtable();
foreach (Cell c in r.Descendants<Cell>())
{
colKey++;
cols.Add(colKey, c);
}
//this takes about 8-10 times longer - 10-30 ticks , still lightning fast
3. If we know where to look for based on column(header) name and row number, we can do this
Hashtable cols = new Hashtable();
foreach (string column in headerReferences.Values)
{
colKey++;
cols.Add(colKey, GetCellValue(workbookPart, worksheetPart, column + r.RowIndex.ToString()));
}
This is one of the MSDN examples and it's whooping 500,000 ticks per row. Took several minutes to parse a 5,000 row spreadsheet. Not acceptable.
Here were were targeting EVERY cell in a row, existing or not
4. I decided to scale back and try to retrieve value from all incoming cells out of order into HashTable
Hashtable cols = new Hashtable();
foreach (Cell c in r.Descendants<Cell>())
{
colKey++;
cols.Add(colKey, GetValueFromCell(c, workbookPart));
}
This is now 500-1,500 ticks per row. Still, lightning fast if we just store the values without any order (not a solution yet)
5. To make sure i preserve the order of columns I make an empty clone of header row for every new row and after i parse through EXISTING cells, i decide where to put them based on Hashtable retrieval
Hashtable cols = (Hashtable)emptyNewRow.Clone();
foreach (Cell c in r.Descendants<Cell>())
{
colKey = headerReferences[GetColumnName(c.CellReference)]; //what # column is this?
cols[colKey] = GetValueFromCell(c, workbookPart); //put value in that column
}
Final result is 9,000-20,000 ticks per row. 30s for 5,000 spreadsheet. Doable, but not ideal.
Here's where I stopped. Any ideas how to make it faster? How can humongous xlsx spreadsheets load so lightning fast and best i can do here is 30s for 5k rows??
Dictionaries didn't do anything for me, not even 1% improvement. And I need result in Hashtables anyway for legacy retrofit
Appendix: referenced methods
public static string GetColumnName(string cellReference)
{
// Match the column name portion of the cell name.
Regex regex = new Regex("[A-Za-z]+");
Match match = regex.Match(cellReference);
return match.Value;
}
public static string GetValueFromCell(Cell cell, WorkbookPart workbookPart)
{
int id;
string cellValue = cell.InnerText;
if (cellValue.Trim().Length > 0)
{
if (cell.DataType != null)
{
switch (cell.DataType.Value)
{
case CellValues.SharedString:
Int32.TryParse(cellValue, out id);
SharedStringItem item = GetSharedStringItemById(workbookPart, id);
if (item.Text != null)
{
cellValue = item.Text.Text;
}
else if (item.InnerText != null)
{
cellValue = item.InnerText;
}
else if (item.InnerXml != null)
{
cellValue = item.InnerXml;
}
break;
case CellValues.Boolean:
switch (cellValue)
{
case "0":
cellValue = "FALSE";
break;
default:
cellValue = "TRUE";
break;
}
break;
}
}
else
{
int excelDate;
if (Int32.TryParse(cellValue, out excelDate))
{
var styleIndex = (int)cell.StyleIndex.Value;
var cellFormats = workbookPart.WorkbookStylesPart.Stylesheet.CellFormats;
var numberingFormats = workbookPart.WorkbookStylesPart.Stylesheet.NumberingFormats;
var cellFormat = (CellFormat)cellFormats.ElementAt(styleIndex);
if (cellFormat.NumberFormatId != null)
{
var numberFormatId = cellFormat.NumberFormatId.Value;
var numberingFormat = numberingFormats.Cast<NumberingFormat>().SingleOrDefault(f => f.NumberFormatId.Value == numberFormatId);
if (numberingFormat != null && numberingFormat.FormatCode.Value.Contains("/yy")) //TODO here i should think of locales
{
DateTime dt = DateTime.FromOADate(excelDate);
cellValue = dt.ToString("MM/dd/yyyy");
}
}
}
}
}
return cellValue;
}
public static string GetCellValue(WorkbookPart wbPart, WorksheetPart wsPart, string addressName)
{
string value = String.Empty; //code from microsoft prefers null, but null is tough to work with
// Use its Worksheet property to get a reference to the cell
// whose address matches the address you supplied.
Cell theCell = wsPart.Worksheet.Descendants<Cell>().
Where(c => c.CellReference == addressName).FirstOrDefault();
// If the cell does not exist, return an empty string.
if (theCell != null)
{
value = theCell.InnerText;
// If the cell represents an integer number, you are done.
// For dates, this code returns the serialized value that
// represents the date. The code handles strings and
// Booleans individually. For shared strings, the code
// looks up the corresponding value in the shared string
// table. For Booleans, the code converts the value into
// the words TRUE or FALSE.
if (theCell.DataType != null)
{
switch (theCell.DataType.Value)
{
case CellValues.SharedString:
// For shared strings, look up the value in the shared strings table.
var stringTable = wbPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
// If the shared string table is missing, something is wrong. Return the index that is in the cell.
//Otherwise, look up the correct text in the table.
if (stringTable != null)
{
value = stringTable.SharedStringTable.ElementAt(int.Parse(value)).InnerText;
}
break;
case CellValues.Boolean:
switch (value)
{
case "0":
value = "FALSE";
break;
default:
value = "TRUE";
break;
}
break;
}
}
}
return value;
}

Related

Find linked formula values from worksheets and replace with actual cell value

In a OOXML spreadsheet .xlsx you can through a linking formula fecth values from another spreadsheet and have them in your worksheet as values, that will always be updated when those values in another spreadsheet are updated.
I am using Open Xml SDK and I basically want to do what this does: https://www.e-iceblue.com/Tutorials/Spire.XLS/Spire.XLS-Program-Guide/Formula/Remove-Formulas-from-Cells-but-Keep-Values-in-Excel-in-C.html
How do I:
Find a value that has formula linking value to a cell in another spreadsheet
Replace the formula value with the actual cell value
Do this foreach cell in each worksheet in a spreadsheet
I have tried this so far: https://learn.microsoft.com/en-us/office/open-xml/how-to-retrieve-the-values-of-cells-in-a-spreadsheet
But I am recieving a NullRefereceneException each time the cell does not contain a formula or just any value. I have tried try-catch and several other ways to escape this exception, but it is not working.
But back to the challenge as outlined above; can anyone help me out?
Basic stuff such as using SOME DIRECTIVE, foreach loop, Open(), Save() I know how to do.

This worked for me:
public void Remove_CellReferences(string filepath)
{
using (SpreadsheetDocument spreadsheet = SpreadsheetDocument.Open(filepath, true))
{
// Delete all cell references in worksheet
List<WorksheetPart> worksheetparts = spreadsheet.WorkbookPart.WorksheetParts.ToList();
foreach (WorksheetPart part in worksheetparts)
{
Worksheet worksheet = part.Worksheet;
var rows = worksheet.GetFirstChild<SheetData>().Elements<Row>(); // Find all rows
foreach (var row in rows)
{
var cells = row.Elements<Cell>();
foreach (Cell cell in cells)
{
if (cell.CellFormula != null)
{
string formula = cell.CellFormula.InnerText;
if (formula.Length > 0)
{
string hit = formula.Substring(0, 1); // Transfer first 1 characters to string
if (hit == "[")
{
CellValue cellvalue = cell.CellValue; // Save current cell value
cell.CellFormula = null; // Remove RTD formula
// If cellvalue does not have a real value
if (cellvalue.Text == "#N/A")
{
cell.DataType = CellValues.String;
cell.CellValue = new CellValue("Invalid data removed");
}
else
{
cell.CellValue = cellvalue; // Insert saved cell value
}
}
}
}
}
}
}
// Delete all external link references
List<ExternalWorkbookPart> extwbParts = spreadsheet.WorkbookPart.ExternalWorkbookParts.ToList();
if (extwbParts.Count > 0)
{
foreach (ExternalWorkbookPart extpart in extwbParts)
{
var elements = extpart.ExternalLink.ChildElements.ToList();
foreach (var element in elements)
{
if (element.LocalName == "externalBook")
{
spreadsheet.WorkbookPart.DeletePart(extpart);
}
}
}
}
// Delete calculation chain
CalculationChainPart calc = spreadsheet.WorkbookPart.CalculationChainPart;
spreadsheet.WorkbookPart.DeletePart(calc);
}
}

Reading cell value with applied Text formatting with OpenXML

I am trying to read Excel Sheet, that contains cells with Text formatting.
Some one of columns has values 1, 1.1, 1.2 and etc.
In Excel all of these values look good, in cells with Text formatting - 1, 1.1, 1.2.
But when I read that cells with OpenXML, I got values 1, 1.1000000000000001, 1.2 - some one of them has decimal parts.
OKay, I checked xl\worksheets\sheet1.xml in *.xlsx file and I see, that really contains value 1.1000000000000001
<row r="3" spans="1:20" ht="15" x14ac:dyDescent="0.25">
<c r="A3" s="2">
<v>1.1000000000000001</v>
</c>
My code is:
List<List<string>> rows = new List<List<string>>();
List<string> cols;
spreadsheetDocument = SpreadsheetDocument.Open(excelFilePath, false);
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();
SharedStringTablePart sstpart = workbookPart.GetPartsOfType<SharedStringTablePart>().First();
SharedStringTable sst = sstpart.SharedStringTable;
foreach (Row r in sheetData.Elements<Row>())
{
cols = new List<string>();
foreach (Cell c in r.Elements<Cell>())
{
if (c.DataType != null && c.DataType == CellValues.SharedString)
{
int ssid = int.Parse(c.CellValue.Text);
string str = sst.ChildElements[ssid].InnerText;
cols.Add(str);
}
else
{
cols.Add(c.CellValue?.InnerText);
}
}
rows.Add(cols);
}
spreadsheetDocument.Close();
How shall I get correct value from such cells? For example, 1.1, but not 1.1000000000000001.

First create this method to get the value of a cell.
public static string GetCellValue(WorkbookPart workbookPart, Cell cell)
{
string value = null;
if (cell.InnerText.Length > 0)
{
value = cell.InnerText;
// If the cell represents an integer number, you are done.
// For dates, this code returns the serialized value that
// represents the date. The code handles strings and
// Booleans individually. For shared strings, the code
// looks up the corresponding value in the shared string
// table. For Booleans, the code converts the value into
// the words TRUE or FALSE.
if (cell.DataType != null)
{
switch (cell.DataType.Value)
{
case CellValues.SharedString:
// For shared strings, look up the value in the
// shared strings table.
var stringTable =
workbookPart.GetPartsOfType<SharedStringTablePart>()
.FirstOrDefault();
// If the shared string table is missing, something
// is wrong. Return the index that is in
// the cell. Otherwise, look up the correct text in
// the table.
if (stringTable != null)
{
value =
stringTable.SharedStringTable
.ElementAt(int.Parse(value)).InnerText;
}
break;
case CellValues.Boolean:
value = value switch
{
"0" => "FALSE",
_ => "TRUE",
};
break;
}
}
}
return value;
}
Then use this code for your inner for loop:
foreach (Cell c in r.Elements<Cell>())
{
string str = GetCellValue(workbookPart, c);
cols.add(str);
}
The GetCellValue method is inspired by Microsoft documentation on this page

Skip columns while reading big excel file using Open Xml Sax approach

I am reading several medium sized excel files, approximately 50 columns x 500 rows. The problem is that some formatting is dragged until column XFD, that is column number 16384=2^16/4 if my math is correct. With OleDb this does not cause any problems as the following query let me select only a subset of the total spreadsheet without huge performance loss caused by the remaining columns
SELECT * FROM [SheetNameA1:BC500]
This takes around 3 seconds. The problem with OleDb is that it requires windows and a file at a disk, both of these causes some trouble with our cloud infrastructure and we would like to use OpenXml instead. OpenXml can be used with DOM-approach or SAX-approach. The first is a show stopper as a call to Worksheet getter at WorksheetPart loads the whole xml with all columns taking around 10 seconds.
Using the SAX approach to navigate the XML gives me the 5 methods for navigating a OpenXmlReader: LoadCurrentElement, Read, ReadFirstChild, ReadNextSibling and Skip. Using these I can:
use Read until I am hitting the Row elements
use ReadFirstChild to hit first Cell element and ReadNextSibling to read remaining and load them using LoadCurrentElement until column BC
use ReadNextSibling until the whole Row is read (ignoring content, i.e. no call to LoadCurrentElement)
The performance loss is in the last step. How can I make the reader jump to the next row without looping through all the cells.
I think the key might be to use Skip to loop over all children. The problem is that I need to be at Row-element to skip all Cell elements and there is no way to "rewind".
Here is an example I made to illustrate the problem. The excel file is simply marked with x in the range A1:XFD500. And here are the messures from while-time and load-time:
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Spreadsheet;
using System.Diagnostics;
using System.Text.RegularExpressions;
using (var file = File.Open("testfile.xlsx", FileMode.Open, FileAccess.Read))
{
var doc = SpreadsheetDocument.Open(file, false);
var workbookPart = doc.WorkbookPart;
var sheet = doc
.WorkbookPart
.Workbook
.Descendants<Sheet>()
.First(s => s.Name == "sheetName");
var worksheetPart = (WorksheetPart)doc.WorkbookPart.GetPartById(sheet.Id);
OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
Dictionary<int, string> sharedStringCache = new Dictionary<int, string>();
var rows = new List<List<object>>();
int i = 0;
foreach (var el in workbookPart.SharedStringTablePart.SharedStringTable.ChildElements)
{
sharedStringCache.Add(i++, el.InnerText);
}
TimeSpan whileTime = TimeSpan.Zero;
TimeSpan loadTime = TimeSpan.Zero;
var stopwatch1 = new Stopwatch();
var stopwatch2 = new Stopwatch();
int lastColumnWithData = 50;
while (reader.Read())
{
if (reader.ElementType == typeof(Row))
{
reader.ReadFirstChild();
List<object> cells = new List<object>();
do
{
if (reader.ElementType == typeof(Cell))
{
stopwatch2.Restart();
Cell c = (Cell)reader.LoadCurrentElement();
loadTime += stopwatch2.Elapsed;
var columnLetters = Regex.Replace(c.CellReference, #"[\d]", string.Empty).ToUpper();
var columnIndex = NumberFromExcelColumn(columnLetters);
var rowIndex = int.Parse(Regex.Replace(c.CellReference, #"[^\d]", string.Empty).ToUpper());
if (columnIndex > lastColumnWithData)
{
stopwatch1.Restart();
while (reader.ReadNextSibling()) {}
whileTime += stopwatch1.Elapsed;
break;
}
object value;
switch (c.DataType?.Value)
{
case CellValues.Boolean:
value = bool.Parse(c.CellValue.InnerText);
break;
case CellValues.Date:
value = DateTime.Parse(c.CellValue.InnerText);
break;
case CellValues.Number:
value = double.Parse(c.CellValue.InnerText);
break;
case CellValues.InlineString:
case CellValues.String:
value = c.CellValue.InnerText;
break;
case CellValues.SharedString:
value = sharedStringCache[int.Parse(c.CellValue.InnerText)];
break;
default:
value = c.CellValue.InnerText;
break;
}
if (value != null)
cells.Add(value);
}
} while (reader.ReadNextSibling());
if (cells.Any())
rows.Add(cells);
}
}
}
static int NumberFromExcelColumn(string column)
{
int retVal = 0;
string col = column.ToUpper();
for (int iChar = col.Length - 1; iChar >= 0; iChar--)
{
char colPiece = col[iChar];
int colNum = colPiece - 64;
retVal = retVal + colNum * (int)Math.Pow(26, col.Length - (iChar + 1));
}
return retVal;
}
made using examples from:
How to open a huge excel file efficiently
Fastest function to generate Excel column letters in C#

The Skip() function will skip the child elements of the current node.
If the current loaded element is the parent Row this can be used.
For the example above it would be simpler break out of the do while loop when the count is reached. Using break will skip any remaining siblings and move to the next element that is a typeof row.
while (reader.Read())
{
if (reader.ElementType == typeof(Row))
{
reader.ReadFirstChild();
int lastColumnWithData = 50;
int columnCount = 0;
do
{
if (reader.ElementType == typeof(Cell))
{
columnCount += 1;
}
} while (reader.ReadNextSibling() && columnCount <= lastColumnWithData);
}
}
or to move to the last child element before exiting iteration, try the following.
do
{
if (reader.ElementType == typeof(Cell))
{
columnCount += 1;
if (columnCount > lastColumnWithData)
{
// skip to last element
while (reader.ReadNextSibling() && !reader.IsEndElement)
{
};
break;
}
}
} while (reader.ReadNextSibling());

In ClosedXML, is there anyway to get the column letter from column header name?

I have an excel worksheet that has column headers and I don't want to hard code the column letter or index so I am trying to figure out how I could make it dynamic. I am looking for something like this:
var ws = wb.Worksheet("SheetName");
var range = ws.RangeUsed();
var table = range.AsTable();
string colLetter = table.GetColumnLetter("ColHeader");
foreach (var row in table.Rows())
{
if (i > 1)
{
string val = row.Cell(colLetter).Value.ToString();
}
i++;
}
Does ClosedXML support anything like the made up GetColumnLetter() function above so I don't have to hard code column letters?

Sure, get the cell you want using a predicate on the CellsUsed collection on the row with the headers, then return the column letter from the column.
public string GetColumnName(IXLTable table, string columnHeader)
{
var cell = table.HeadersRow().CellsUsed(c => c.Value.ToString() == columnHeader).FirstOrDefault();
if (cell != null)
{
return cell.WorksheetColumn().ColumnLetter();
}
return null;
}

For version 0.95.4.0 I did the next steps
var ws = wb.Worksheet("SheetName");
var range = ws.RangeUsed();
var table = range.AsTable();
var cell = table.FindColumn(c => c.FirstCell().Value.ToString() == yourColumnName);
if (cell != null)
{
var columnLetter = cell.RangeAddress.FirstAddress.ColumnLetter;
}

EPPlus: How to traverse through every content block (cell or merged range) of a worksheet?

I'm trying to traverse through cells and merged cells inside a worksheet, and replace some template texts with dynamic values. However, I didn't manage to loop through all non-empty cells until now. Currently I tried this code, but it threw NullReferenceException when it tried to access the Text property of a cell which is merged.
I'm using a template file, copying the template worksheet from there into my constructed workbook. I tried it with a single workbook without the copy, it gave the same result.
I also tried to put the Where(cell => !cell.Merge) closure into the first foreach loop, but with the same result.
using (var p = new ExcelPackage(new FileInfo(templateFile)))
{
var ws = _excel.Workbook.Worksheets.Add("Report", p.Workbook.Worksheets[sablonMunkafuzet]);
foreach (ExcelRangeBase cell in ws.Cells)
{
if (string.IsNullOrEmpty(cell.Text)) continue;
var s = cell.Text;
if (s.StartsWith("^^"))
ProcessCell(cell, s.Substring(2));
}
foreach (string mc in ws.MergedCells)
{
var s = ws.Cells[mc].Text;
if (s.StartsWith("^^"))
ProcessCell(ws.Cells[mc], s.Substring(2));
}
}
}
EDIT: I would like to achieve that what I would do manually if I open that template in Excel. To find every "block" (I mean individual cells, or merged cell ranges) where a specifix text pattern appears, then process it, and change the value of that "block" to what I calculate.

It seems like I've found a solution for my scenario, here it is if anyone needs such code in the future.
The following code piece loops through every block (individual cell or merged range) in a worksheet, and does some processing. In this case I'm doing a text examination if there is any replacable formula for me. The ^^ is my special signal to indicate that there is a template definition inside the block what should be replaced with my runtime data.
var ws = _excel.Workbook.Worksheets["myTemplateWorksheet"];
var dim = ws.Dimension;
// first loop through all non-merged cells
for (int r = dim.Start.Row; r <= dim.End.Row; ++r)
for (int c = dim.Start.Column; c <= dim.End.Column; ++c)
{
if (ws.Cells[r, c].Merge) continue;
string s = GetRangeText(ws.Cells[r,c]);
if (string.IsNullOrEmpty(s)) continue;
if (s.StartsWith("^^"))
ProcessCell(ws.Cells[r, c], s.Substring(2));
}
// then loop through all merged ranges
foreach (string mc in ws.MergedCells)
{
string s = GetRangeText(ws.Cells[mc]);
if (string.IsNullOrEmpty(s)) continue;
if (s.StartsWith("^^"))
ProcessCell(ws.Cells[mc], s.Substring(2));
}
With the helper method, which extracts the text from a range, taking the array representation of merged ranges into count:
private string GetRangeText(ExcelRangeBase range)
{
var val = range.Value;
string s = val as string;
if (string.IsNullOrEmpty(s))
{
object[,] arr = val as object[,];
if (arr != null && arr.GetLength(0) > 0 && arr.GetLength(1) > 0)
s = arr[0, 0] as string;
}
if (string.IsNullOrEmpty(s) && val != null)
s = val.ToString();
return s;
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading a full table from Excel using Open XML ...FASTER - c#

Related

Find linked formula values from worksheets and replace with actual cell value

Reading cell value with applied Text formatting with OpenXML

Skip columns while reading big excel file using Open Xml Sax approach

In ClosedXML, is there anyway to get the column letter from column header name?

EPPlus: How to traverse through every content block (cell or merged range) of a worksheet?

Categories

Resources