Reading cell value with applied Text formatting with OpenXML - c#

I am trying to read Excel Sheet, that contains cells with Text formatting.
Some one of columns has values 1, 1.1, 1.2 and etc.
In Excel all of these values look good, in cells with Text formatting - 1, 1.1, 1.2.
But when I read that cells with OpenXML, I got values 1, 1.1000000000000001, 1.2 - some one of them has decimal parts.
OKay, I checked xl\worksheets\sheet1.xml in *.xlsx file and I see, that really contains value 1.1000000000000001
<row r="3" spans="1:20" ht="15" x14ac:dyDescent="0.25">
<c r="A3" s="2">
<v>1.1000000000000001</v>
</c>
My code is:
List<List<string>> rows = new List<List<string>>();
List<string> cols;
spreadsheetDocument = SpreadsheetDocument.Open(excelFilePath, false);
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();
SharedStringTablePart sstpart = workbookPart.GetPartsOfType<SharedStringTablePart>().First();
SharedStringTable sst = sstpart.SharedStringTable;
foreach (Row r in sheetData.Elements<Row>())
{
cols = new List<string>();
foreach (Cell c in r.Elements<Cell>())
{
if (c.DataType != null && c.DataType == CellValues.SharedString)
{
int ssid = int.Parse(c.CellValue.Text);
string str = sst.ChildElements[ssid].InnerText;
cols.Add(str);
}
else
{
cols.Add(c.CellValue?.InnerText);
}
}
rows.Add(cols);
}
spreadsheetDocument.Close();
How shall I get correct value from such cells? For example, 1.1, but not 1.1000000000000001.

First create this method to get the value of a cell.
public static string GetCellValue(WorkbookPart workbookPart, Cell cell)
{
string value = null;
if (cell.InnerText.Length > 0)
{
value = cell.InnerText;
// If the cell represents an integer number, you are done.
// For dates, this code returns the serialized value that
// represents the date. The code handles strings and
// Booleans individually. For shared strings, the code
// looks up the corresponding value in the shared string
// table. For Booleans, the code converts the value into
// the words TRUE or FALSE.
if (cell.DataType != null)
{
switch (cell.DataType.Value)
{
case CellValues.SharedString:
// For shared strings, look up the value in the
// shared strings table.
var stringTable =
workbookPart.GetPartsOfType<SharedStringTablePart>()
.FirstOrDefault();
// If the shared string table is missing, something
// is wrong. Return the index that is in
// the cell. Otherwise, look up the correct text in
// the table.
if (stringTable != null)
{
value =
stringTable.SharedStringTable
.ElementAt(int.Parse(value)).InnerText;
}
break;
case CellValues.Boolean:
value = value switch
{
"0" => "FALSE",
_ => "TRUE",
};
break;
}
}
}
return value;
}
Then use this code for your inner for loop:
foreach (Cell c in r.Elements<Cell>())
{
string str = GetCellValue(workbookPart, c);
cols.add(str);
}
The GetCellValue method is inspired by Microsoft documentation on this page

Related

Find linked formula values from worksheets and replace with actual cell value

In a OOXML spreadsheet .xlsx you can through a linking formula fecth values from another spreadsheet and have them in your worksheet as values, that will always be updated when those values in another spreadsheet are updated.
I am using Open Xml SDK and I basically want to do what this does: https://www.e-iceblue.com/Tutorials/Spire.XLS/Spire.XLS-Program-Guide/Formula/Remove-Formulas-from-Cells-but-Keep-Values-in-Excel-in-C.html
How do I:
Find a value that has formula linking value to a cell in another spreadsheet
Replace the formula value with the actual cell value
Do this foreach cell in each worksheet in a spreadsheet
I have tried this so far: https://learn.microsoft.com/en-us/office/open-xml/how-to-retrieve-the-values-of-cells-in-a-spreadsheet
But I am recieving a NullRefereceneException each time the cell does not contain a formula or just any value. I have tried try-catch and several other ways to escape this exception, but it is not working.
But back to the challenge as outlined above; can anyone help me out?
Basic stuff such as using SOME DIRECTIVE, foreach loop, Open(), Save() I know how to do.
This worked for me:
public void Remove_CellReferences(string filepath)
{
using (SpreadsheetDocument spreadsheet = SpreadsheetDocument.Open(filepath, true))
{
// Delete all cell references in worksheet
List<WorksheetPart> worksheetparts = spreadsheet.WorkbookPart.WorksheetParts.ToList();
foreach (WorksheetPart part in worksheetparts)
{
Worksheet worksheet = part.Worksheet;
var rows = worksheet.GetFirstChild<SheetData>().Elements<Row>(); // Find all rows
foreach (var row in rows)
{
var cells = row.Elements<Cell>();
foreach (Cell cell in cells)
{
if (cell.CellFormula != null)
{
string formula = cell.CellFormula.InnerText;
if (formula.Length > 0)
{
string hit = formula.Substring(0, 1); // Transfer first 1 characters to string
if (hit == "[")
{
CellValue cellvalue = cell.CellValue; // Save current cell value
cell.CellFormula = null; // Remove RTD formula
// If cellvalue does not have a real value
if (cellvalue.Text == "#N/A")
{
cell.DataType = CellValues.String;
cell.CellValue = new CellValue("Invalid data removed");
}
else
{
cell.CellValue = cellvalue; // Insert saved cell value
}
}
}
}
}
}
}
// Delete all external link references
List<ExternalWorkbookPart> extwbParts = spreadsheet.WorkbookPart.ExternalWorkbookParts.ToList();
if (extwbParts.Count > 0)
{
foreach (ExternalWorkbookPart extpart in extwbParts)
{
var elements = extpart.ExternalLink.ChildElements.ToList();
foreach (var element in elements)
{
if (element.LocalName == "externalBook")
{
spreadsheet.WorkbookPart.DeletePart(extpart);
}
}
}
}
// Delete calculation chain
CalculationChainPart calc = spreadsheet.WorkbookPart.CalculationChainPart;
spreadsheet.WorkbookPart.DeletePart(calc);
}
}

Getting incorrect cell value while parsing excel with OpenXML

I am trying to parse an excel and get the result in datatable using C# and openxml.
Below is my code snippet.
value = cell.CellValue.InnerText;
if (cell.DataType != null && cell.DataType.Value == CellValues.SharedString)
{
return doc.WorkbookPart.SharedStringTablePart.SharedStringTable.ChildElements.GetItem(int.Parse(value)).InnerText;
}
return value;
But if the cell value is 80.3600 then it is getting parsed as 80.36.
Also if the value is 03-Jan-2018 then it is getting parsed as 43103.
The problem is, the excel which I am trying to parse is dynamically generated and at run time I won't know which column is date and which column is numeric.
Is there any way to get the value as it is or get every value as a string i.e. no formatting?
i've noticed , numeric and date time cell's value have different styleIndex value.
you can get cell format by styleIndex from doc.WorkbookPart.WorkbookStylesPart.Stylesheet.NumberingFormats.
var doc = SpreadsheetDocument.Open(File.Open("D:\\123.xlsx", FileMode.Open), false);
var sheet = doc.WorkbookPart.Workbook.Descendants<Sheet>().FirstOrDefault();
WorksheetPart wsPart = (WorksheetPart)(doc.WorkbookPart.GetPartById(sheet.Id));
var cells = wsPart.Worksheet.Descendants<Cell>().ToList();
var numberingFormats = doc.WorkbookPart.WorkbookStylesPart.Stylesheet.NumberingFormats.ToList();
var stringTable = doc.WorkbookPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
foreach (var cell in cells)
{
if (cell.DataType == null)
{
//DateTime
if (cell.StyleIndex != null)
{
var numerFormat = numberingFormats.ElementAt((int) cell.StyleIndex.Value - 1) as NumberingFormat;
if (numerFormat.FormatCode.Value == "[$-409]mmmm\\ d\\,\\ yyyy;#")
{
Console.WriteLine(DateTime.FromOADate(double.Parse(cell.InnerText)).ToString("MMMM dd,yyyy"));
}
else if (numerFormat.FormatCode.Value == "[$-409]dd\\-mmm\\-yy;#")
{
Console.WriteLine(DateTime.FromOADate(double.Parse(cell.InnerText)).ToString("dd-MMM-yy"));
}
}
else
{
//Numeric
Console.WriteLine(int.Parse(cell.InnerText));
}
}
else if (cell.DataType.Value == CellValues.SharedString)
{
Console.WriteLine(stringTable.SharedStringTable.ElementAt(int.Parse(cell.InnerText)).InnerText);
}
}
also can read this one:Excel Interop cell formatting of Dates

In ClosedXML, is there anyway to get the column letter from column header name?

I have an excel worksheet that has column headers and I don't want to hard code the column letter or index so I am trying to figure out how I could make it dynamic. I am looking for something like this:
var ws = wb.Worksheet("SheetName");
var range = ws.RangeUsed();
var table = range.AsTable();
string colLetter = table.GetColumnLetter("ColHeader");
foreach (var row in table.Rows())
{
if (i > 1)
{
string val = row.Cell(colLetter).Value.ToString();
}
i++;
}
Does ClosedXML support anything like the made up GetColumnLetter() function above so I don't have to hard code column letters?
Sure, get the cell you want using a predicate on the CellsUsed collection on the row with the headers, then return the column letter from the column.
public string GetColumnName(IXLTable table, string columnHeader)
{
var cell = table.HeadersRow().CellsUsed(c => c.Value.ToString() == columnHeader).FirstOrDefault();
if (cell != null)
{
return cell.WorksheetColumn().ColumnLetter();
}
return null;
}
For version 0.95.4.0 I did the next steps
var ws = wb.Worksheet("SheetName");
var range = ws.RangeUsed();
var table = range.AsTable();
var cell = table.FindColumn(c => c.FirstCell().Value.ToString() == yourColumnName);
if (cell != null)
{
var columnLetter = cell.RangeAddress.FirstAddress.ColumnLetter;
}

Reading a full table from Excel using Open XML ...FASTER

Warning: long post due to examples and results
There have been threads here about how to read an Open XML spreadsheet row with null cells in between columns. I drew some of my answers from here reading Excel Open XML is ignoring blank cells
I'm able to read a table just fine from xlsx, but it is 10 times slower than reading from CSV, while the open XML structure should(?) yield superior results.
Here's what I got for testing code base:
foreach (Row r in sheetData.Descendants<Row>())
{
sw.Start();
//find a row marked as "header" and get list of columns that define width of table
if (!headerRowFound)
{
headerRowFound = CheckOXMLHeaderRow(r, workbookPart, out headerReferences);
if (!headerRowFound)
continue;
}
rowKey++;
//////////////////////////////////////////////////////////////////////
///////////////////here we are going to do work//////////////////////
////////////////////////////////////////////////////////////////////
AddRow(rowKey, cols);
sw.Stop();
Debug.WriteLine("XLSX Row added in \t" + sw.ElapsedTicks.ToString() + "\tticks");
sw.Reset();
}
In my data a row is 68 cells, with only 5-10 of them filled out
0. For comparison, going through CSV rows takes about 300 ticks (lightning fast). 5000 rows adds in 3ms
1. Code as is processes through row enumerators only in 1-4 ticks
2. This code simply grabs all cells sequentially and stores them in a row (column order is screwed up due to OXML nature)
Hashtable cols = new Hashtable();
foreach (Cell c in r.Descendants<Cell>())
{
colKey++;
cols.Add(colKey, c);
}
//this takes about 8-10 times longer - 10-30 ticks , still lightning fast
3. If we know where to look for based on column(header) name and row number, we can do this
Hashtable cols = new Hashtable();
foreach (string column in headerReferences.Values)
{
colKey++;
cols.Add(colKey, GetCellValue(workbookPart, worksheetPart, column + r.RowIndex.ToString()));
}
This is one of the MSDN examples and it's whooping 500,000 ticks per row. Took several minutes to parse a 5,000 row spreadsheet. Not acceptable.
Here were were targeting EVERY cell in a row, existing or not
4. I decided to scale back and try to retrieve value from all incoming cells out of order into HashTable
Hashtable cols = new Hashtable();
foreach (Cell c in r.Descendants<Cell>())
{
colKey++;
cols.Add(colKey, GetValueFromCell(c, workbookPart));
}
This is now 500-1,500 ticks per row. Still, lightning fast if we just store the values without any order (not a solution yet)
5. To make sure i preserve the order of columns I make an empty clone of header row for every new row and after i parse through EXISTING cells, i decide where to put them based on Hashtable retrieval
Hashtable cols = (Hashtable)emptyNewRow.Clone();
foreach (Cell c in r.Descendants<Cell>())
{
colKey = headerReferences[GetColumnName(c.CellReference)]; //what # column is this?
cols[colKey] = GetValueFromCell(c, workbookPart); //put value in that column
}
Final result is 9,000-20,000 ticks per row. 30s for 5,000 spreadsheet. Doable, but not ideal.
Here's where I stopped. Any ideas how to make it faster? How can humongous xlsx spreadsheets load so lightning fast and best i can do here is 30s for 5k rows??
Dictionaries didn't do anything for me, not even 1% improvement. And I need result in Hashtables anyway for legacy retrofit
Appendix: referenced methods
public static string GetColumnName(string cellReference)
{
// Match the column name portion of the cell name.
Regex regex = new Regex("[A-Za-z]+");
Match match = regex.Match(cellReference);
return match.Value;
}
public static string GetValueFromCell(Cell cell, WorkbookPart workbookPart)
{
int id;
string cellValue = cell.InnerText;
if (cellValue.Trim().Length > 0)
{
if (cell.DataType != null)
{
switch (cell.DataType.Value)
{
case CellValues.SharedString:
Int32.TryParse(cellValue, out id);
SharedStringItem item = GetSharedStringItemById(workbookPart, id);
if (item.Text != null)
{
cellValue = item.Text.Text;
}
else if (item.InnerText != null)
{
cellValue = item.InnerText;
}
else if (item.InnerXml != null)
{
cellValue = item.InnerXml;
}
break;
case CellValues.Boolean:
switch (cellValue)
{
case "0":
cellValue = "FALSE";
break;
default:
cellValue = "TRUE";
break;
}
break;
}
}
else
{
int excelDate;
if (Int32.TryParse(cellValue, out excelDate))
{
var styleIndex = (int)cell.StyleIndex.Value;
var cellFormats = workbookPart.WorkbookStylesPart.Stylesheet.CellFormats;
var numberingFormats = workbookPart.WorkbookStylesPart.Stylesheet.NumberingFormats;
var cellFormat = (CellFormat)cellFormats.ElementAt(styleIndex);
if (cellFormat.NumberFormatId != null)
{
var numberFormatId = cellFormat.NumberFormatId.Value;
var numberingFormat = numberingFormats.Cast<NumberingFormat>().SingleOrDefault(f => f.NumberFormatId.Value == numberFormatId);
if (numberingFormat != null && numberingFormat.FormatCode.Value.Contains("/yy")) //TODO here i should think of locales
{
DateTime dt = DateTime.FromOADate(excelDate);
cellValue = dt.ToString("MM/dd/yyyy");
}
}
}
}
}
return cellValue;
}
public static string GetCellValue(WorkbookPart wbPart, WorksheetPart wsPart, string addressName)
{
string value = String.Empty; //code from microsoft prefers null, but null is tough to work with
// Use its Worksheet property to get a reference to the cell
// whose address matches the address you supplied.
Cell theCell = wsPart.Worksheet.Descendants<Cell>().
Where(c => c.CellReference == addressName).FirstOrDefault();
// If the cell does not exist, return an empty string.
if (theCell != null)
{
value = theCell.InnerText;
// If the cell represents an integer number, you are done.
// For dates, this code returns the serialized value that
// represents the date. The code handles strings and
// Booleans individually. For shared strings, the code
// looks up the corresponding value in the shared string
// table. For Booleans, the code converts the value into
// the words TRUE or FALSE.
if (theCell.DataType != null)
{
switch (theCell.DataType.Value)
{
case CellValues.SharedString:
// For shared strings, look up the value in the shared strings table.
var stringTable = wbPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
// If the shared string table is missing, something is wrong. Return the index that is in the cell.
//Otherwise, look up the correct text in the table.
if (stringTable != null)
{
value = stringTable.SharedStringTable.ElementAt(int.Parse(value)).InnerText;
}
break;
case CellValues.Boolean:
switch (value)
{
case "0":
value = "FALSE";
break;
default:
value = "TRUE";
break;
}
break;
}
}
}
return value;
}

Get Column number from cell value, openxml

I have an Excel wherein i want to get the column number for eg the below image :
In the above image , i know that the records will appear on the 1st row , but i am unsure of the Column number. In above example the column value : "Quantity" appears on "D1". I know the row number how can i find the column number ("D" in the above case) using OPEN XML, as the column name quantity might appear anywhere in the excel and i need to find the corresponding values of only quantity.
Unfortunately there's not a single method you can call to find the correct cell. Instead you'll need to iterate over the cells to find the matching text. To complicate things slightly, the value in the cell is not always the actual text. Instead strings can be stored in the SharedStringTablePart and the value of the cell is an index into the contents of that table.
Something like the following should do what you're after:
private static string GetCellReference(string filename, string sheetName, int rowIndex, string textToFind)
{
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filename, false))
{
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
//get the correct sheet
Sheet sheet = workbookPart.Workbook.Descendants<Sheet>().Where(s => s.Name == sheetName).First();
if (sheet != null)
{
WorksheetPart worksheetPart = workbookPart.GetPartById(sheet.Id) as WorksheetPart;
SharedStringTablePart stringTable = workbookPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();
Row row = sheetData.Elements<Row>().Where(r => r.RowIndex == rowIndex).First();
if (row != null)
{
foreach (Cell c in row.Elements<Cell>())
{
string cellText;
if (c.DataType == CellValues.SharedString)
{
//the value will be a number which is an index into the shared strings table
int index = int.Parse(c.CellValue.InnerText);
cellText = stringTable.SharedStringTable.ElementAt(index).InnerText;
}
else
{
//just take the value from the cell (note this won't work for dates and other types)
cellText = c.CellValue.InnerText;
}
if (cellText == textToFind)
{
return c.CellReference;
}
}
}
}
}
return null;
}
This can then be called like this:
string cellReference = GetCellReference(#"c:\temp\test.xlsx", "Sheet1", 1, "Quantity");
Console.WriteLine(cellReference); //prints D1 for your example
If you just want D rather than D1 you can use a simple regex to remove the numbers:
private static string GetColumnName(string cellReference)
{
if (cellReference == null)
return null;
return Regex.Replace(cellReference, "[0-9]", "");
}
And then use it like this:
string cellReference = GetCellReference(#"c:\temp\test.xlsx", "Sheet1", 1, "Quantity");
Console.WriteLine(GetColumnName(cellReference)); //prints D for your example

Categories