Is there a SAX way to loop through OpenXML rows? - c#

I'm parsing a large file using the SAX approach offered on:
Parsing and Reading Large Excel Files with the Open XML SDK
This is my modified version (only getting the row number for simplicity)
using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open("BigFile.xlsx", true))
{
WorkbookPart workbookPart = myDoc.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
String rowNum;
while (reader.Read())
{
if (reader.ElementType == typeof(Row))
{
if (reader.HasAttributes)
rowNum = reader.Attributes.First(a => a.LocalName == "r").Value
}
}
}
The problem is that this loops through every item/cell/column/whatnot and only acts when the element type is Row.
Is there a SAX way to loop only through the rows and not every item in the worksheet?
Thanks,

The key is to use the Skip() and ReadNextSibling() methods of the reader...
using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open("BigFile.xlsx", true))
{
WorkbookPart workbookPart = myDoc.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
String rowNum;
while (reader.Read())
{
if (reader.ElementType == typeof(Row))
{
do
{
if (reader.HasAttributes)
rowNum = reader.Attributes.First(a => a.LocalName == "r").Value;
} while (reader.ReadNextSibling()); // Skip to the next row
break; // We just looped through all the rows so no need to continue reading the worksheet
}
if (reader.ElementType != typeof(Worksheet)) // Dont' want to skip the contents of the worksheet
reader.Skip(); // Skip contents of any node before finding the first row.
}
}

Related

How to count the rows per sheet using OpenXML

The code below only count the total number of rows in sheet 1 but I also want to count the rows in sheet 2 and sheet 3 and so on
using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open("PATH", true))
{
//Get workbookpart
WorkbookPart workbookPart = myDoc.WorkbookPart;
//then access to the worksheet part
IEnumerable<WorksheetPart> worksheetPart = workbookPart.WorksheetParts;
foreach (WorksheetPart WSP in worksheetPart)
{
//find sheet data
IEnumerable<SheetData> sheetData = WSP.Worksheet.Elements<SheetData>();
// Iterate through every sheet inside Excel sheet
foreach (SheetData SD in sheetData)
{
IEnumerable<Row> row = SD.Elements<Row>(); // Get the row IEnumerator
Console.WriteLine(row.Count()); // Will give you the count of rows
}
}
}

Read excel by sheet name with OpenXML

I am new at OpenXML c# and I want to read rows from excel file. But I need to read excel sheet by name. this is my sample code that reads first sheet:
using (var spreadSheet = SpreadsheetDocument.Open(path, true))
{
WorkbookPart workbookPart = spreadSheet.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();
foreach (Row r in sheetData.Elements<Row>())
{
foreach (Cell c in r.Elements<Cell>())
{
if (c.DataType != null && c.DataType == CellValues.SharedString)
{
// reading cells
}
}
}
But how can I find by sheet name and read cells.
I've done it like in the code snippet below. It's basically Workbook->Spreadsheet->Sheet then getting the Name attribute of the sheet.
The basic underling xml looks like this:
<x:workbook>
<x:sheets>
<x:sheet name="Sheet1" sheetId="1" r:id="rId1" />
<x:sheet name="TEST sheet Name" sheetId="2" r:id="rId2" />
</x:sheets>
</x:workbook>
The id value is what the Open XML package uses internally to identify each sheet and link it with the other XML parts. That's why the line of code that follows identifying the name uses GetPartById to pick up the WorksheetPart.
using (SpreadsheetDocument doc = SpreadsheetDocument.Open(path, false))
{
WorkbookPart bkPart = doc.WorkbookPart;
DocumentFormat.OpenXml.Spreadsheet.Workbook workbook = bkPart.Workbook;
DocumentFormat.OpenXml.Spreadsheet.Sheet s = workbook.Descendants<DocumentFormat.OpenXml.Spreadsheet.Sheet>().Where(sht => sht.Name == "Sheet1").FirstOrDefault();
WorksheetPart wsPart = (WorksheetPart)bkPart.GetPartById(s.Id);
DocumentFormat.OpenXml.Spreadsheet.SheetData sheetdata = wsPart.Worksheet.Elements<DocumentFormat.OpenXml.Spreadsheet.SheetData>().FirstOrDefault();
foreach (DocumentFormat.OpenXml.Spreadsheet.Row r in sheetdata.Elements<DocumentFormat.OpenXml.Spreadsheet.Row>())
{
DocumentFormat.OpenXml.Spreadsheet.Cell c = r.Elements<DocumentFormat.OpenXml.Spreadsheet.Cell>().First();
txt += c.CellValue.Text + Environment.NewLine;
}
this.txtMessages.Text += txt;
}

DocumentFormat.openxml Excel File Reading Issue

I have used DocumentFormat.OpenXml dll in one of my project for reading and writing excel file.
During Reading of Excel File, Let's say for some column say Column1 I am having cell values as "TRUE" and "FALSE". When I read this Excel File using Following Code
private SharedStringTable sharedStringTable;
using (SpreadsheetDocument doc = SpreadsheetDocument.Open(fileName, false))
{
WorkbookPart workbookPart = doc.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.FirstOrDefault();
SharedStringTablePart sharedStringTablePart = workbookPart.SharedStringTablePart;
if (sharedStringTablePart != null)
{
sharedStringTable = sharedStringTablePart.SharedStringTable;
}
var sheets = workbookPart.Workbook.Sheets;
foreach (Sheet sheet in sheets)
{
Worksheet requiredItem = (doc.WorkbookPart.GetPartById(sheet.Id.Value) as WorksheetPart).Worksheet;
var sheetData = requiredItem.Elements<SheetData>().First();
foreach (var rowItem in sheetData.Elements<Row>())
{
foreach (var item in rowItem.Elements<Cell>())
{
string requiredText = string.empty;
if (item.CellValue != null)
{
requiredText = item.CellValue.InnerText;
}
}
}
}
}
At that time for Cell Values "TRUE" and "FALSE" i am getting values 1 and 0 Respectively.
Can anyone provide me any way so that I can get values "TRUE" and "FALSE" instead of 1 and 0 ?

How to count rows per worksheet in OpenXML

I switched from Interop library to OpenXML, because I need to read large Excel files. Before that I could use:
worksheet.UsedRange.Rows.Count
to get the number of rows with data on the worksheet. I used this information to make a progressbar. In OpenXML I do not know how to get the same information about the worksheet. What I have now is this code:
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(path, false))
{
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();
int row_count = 0, col_count;
// here I would like to get the info about the number of rows
foreach (Row r in sheetData.Elements<Row>())
{
col_count = 0;
if (row_count > 10)
{
foreach (Cell c in r.Elements<Cell>())
{
// do some stuff
// update progressbar
}
}
row_count++;
}
}
It's not that hard (When you use LINQ),
using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open("PATH", true))
{
//Get workbookpart
WorkbookPart workbookPart = myDoc.WorkbookPart;
//then access to the worksheet part
IEnumerable<WorksheetPart> worksheetPart = workbookPart.WorksheetParts;
foreach (WorksheetPart WSP in worksheetPart)
{
//find sheet data
IEnumerable<SheetData> sheetData = WSP.Worksheet.Elements<SheetData>();
// Iterate through every sheet inside Excel sheet
foreach (SheetData SD in sheetData)
{
IEnumerable<Row> row = SD.Elements<Row>(); // Get the row IEnumerator
Console.WriteLine(row.Count()); // Will give you the count of rows
}
}
}
Edited with Linq now it's straight forward.

OpenXML, SAX, and Simply Reading an Xlsx file

I have been struggling to find a solution on how to read a large xlsx file with OpenXml. I have tried the microsoft samples without luck. I simply need to read an excel file into a DataTable in c#. I am not concerned with value types in the datatable, everything can be stored as a string values.
The samples I have found so far don't retain the structure of the spreadsheet and only return the values of the cells.
Any ideas?
The open xml SDK can be a little hard to understand. However, I have found it useful to use http://simpleooxml.codeplex.com/ this code plex project. It adds a thin layer over the sdk to more easily parse through excel files and work with styles.
Then you can use something like the following with their worksheet reader to recurse through and grab the values you want
System.IO.MemoryStream ms = Utility.StreamToMemory(xslxTemplate);
using (SpreadsheetDocument document = SpreadsheetDocument.Open(ms, true))
{
IEnumerable<Sheet> sheets = document.WorkbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>();
if (sheets.Count() == 0)
{
// The specified worksheet does not exist.
return null;
}
string relationshipId = sheets.First().Id.Value;
WorksheetPart worksheetPart = (WorksheetPart)document.WorkbookPart.GetPartById(relationshipId);
string myval =WorksheetReader.GetCell("A", 0, worksheetPart).CellValue.InnerText;
// Put in a loop to go through contents of document
}
You can get DataTable this way:
using (SpreadsheetDocument spreadsheet = SpreadsheetDocument.Open(fileName, false))
{
DataTable data = ToDataTable(spreadsheet, "Employees");
}
This method will read Excel sheet data as DataTable
public DataTable ToDataTable(SpreadsheetDocument spreadsheet, string worksheetName)
{
var workbookPart = spreadsheet.WorkbookPart;
var sheet = workbookPart
.Workbook
.Descendants<Sheet>()
.FirstOrDefault(s => s.Name == worksheetName);
var worksheetPart = sheet == null
? null
: workbookPart.GetPartById(sheet.Id) as WorksheetPart;
var dataTable = new DataTable();
if (worksheetPart != null)
{
var sheetData = worksheetPart.Worksheet.GetFirstChild<SheetData>();
foreach (Row row in sheetData.Descendants<Row>())
{
var values = row
.Descendants<Cell>()
.Select(cell =>
{
var value = cell.CellValue.InnerXml;
if (cell.DataType != null && cell.DataType.Value == CellValues.SharedString)
{
value = workbookPart
.SharedStringTablePart
.SharedStringTable
.ChildElements[int.Parse(value)]
.InnerText;
}
return (object)value;
})
.ToArray();
dataTable.Rows.Add(values);
}
}
return dataTable;
}

Categories