OpenXML does not help to read large Excel files contrary to documentation - c#

The documentation says that:
The following code segment is used to read a very large Excel
file using the DOM approach.
and then goes an example. I use it to implement reading a relatively large file with 700K rows. I have this code by now:
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(path, false))
{
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();
// no other code
}
When I start my program, I see how quickly - just in five seconds - it runs out of memory (>1G). And the debugger points to this line of code:
SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();
So, I need to know whether OpenXML really helps to read large files. And, if not, what are the alternatives (Interop does not help - I've already checked it).
EDIT
One extra mysterious thing. This code I get by now:
OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
while (reader.Read())
{
if (reader.ElementType == typeof(Row))
{
count++;
}
}
gives me in the count variable over than a million of rows. However, I do have 14K on the first sheet and 700K on the second sheet. It is very strange. So, my extra question is how to parse only rows with data using SAX approach. And one final mystery of reading large Excel files on OpenXML. One guy in this thread says that: "Turns out that the worksheets are enumerated backwards for some reason (so the first of my three sheets is actually index 3". So, my final extra question is how to get the sheet you want. At this moment I use this code:
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
But taking into account what the say, I'm not sure that in my case I would actually get the first worksheet.

You seem to have a few questions, I'll try and tackle them one-by-one.
So, I need to know whether OpenXML really helps to read large files. And, if not, what are the alternatives (Interop does not help - I've already checked it).
Yes, the OpenXml SDK is great for reading large files but you may need to use a SAX approach rather than a DOM approach. From the same documentation you cite:
However, the DOM approach requires loading entire Open XML parts into memory, which can cause an Out of Memory exception when you are working with really large files.... Consider using SAX when you need to handle very large files.
The DOM approach loads the whole sheet into memory which for a large sheet can cause out of memory exceptions. Using the SAX approach you read each element in turn which reduces the memory consumption considerably.
So, my extra question is how to parse only rows with data using SAX approach
You are only getting the rows that have data (or at least the rows that exist in the XML) using the SDK. You appear to have asked this as a separate question which I've answered in more detail but essentially you are seeing the start and end of each row element using the code in your question. See my answer to your Why does OpenXML read rows twice question for more details.
So, my final extra question is how to get the sheet you want.
You need to find the Sheet by name which is a descendant of the Workbook. Once you have that you can use its Id to get the WorksheetPart:
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filename, false))
{
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
Sheet sheet = workbookPart.Workbook.Descendants<Sheet>().Where(s => s.Name == sheetName).First();
if (sheet != null)
{
WorksheetPart worksheetPart = workbookPart.GetPartById(sheet.Id) as WorksheetPart;
//read worksheetPart...
}
}

Related

What is the easiest way to count .xlsx workbook sheets using c#, NPOI and an XSSF workbook?

I am trying to count the number of sheets in a workbook. The workbook is created using NPOI and there doesn't seem to be a way to count the amount of sheets using the C# version of NPOI?
This is a really tricky thing to both explain and show... But I will give it a try.
What I am trying to do is having an existing excel-file as a template for statistics. This existing excel-file can have different amounts of templates and I need to be able to count these templates to know where to place my new sheets and edit their names.
The sender of the data only has to chose which template-sheet should be filled with which data, and I will then remove the template-sheets from the workbook after all data has been inserted.
What I have tried:
I have read the documentation and searched for information and have tried the following approaches:
getNumberOfSheets - How to know number of sheets in a workbook?
Problem with this approach: The C# version of NPOI doesn't seem to have getNumberOfSheets.
Convert found row-counters into sheet-counters - NPOI - Get excel row count to check if it is empty
Can't really recreate the code to work for sheets as the functionality for sheets and rows are too different.
var sheetIndex = 0;
foreach (var sheet in requestBody.Sheets)
{
if (sheet.TemplateNumber == "")
{
sheetTemplate = templateWorkbook.CreateSheet(sheet.Name);
}
else
{
sheetTemplate = templateWorkbook.CloneSheet(Convert.ToInt32(sheet.SheetTemplate));
if (!templates.Contains(Convert.ToInt32(sheet.SheetTemplate)))
{
templates.Add(Convert.ToInt32(sheet.SheetTemplate));
}
// Do math's to make sure we add the name to the newly created sheet further down the code (I need to actual index here)
}
// Insert statistics
//After inserting statistics:
workingCopy.SetSheetName(sheetIndex, sheet.Name);
foreach (var template in templates)
{
workingCopy.RemoveSheetAt(template);
}
}
You can get number of sheets from NumberOfSheets property in XSSFWorkbook class.

skip a sheet in excel workbook using openxml(C#)

iterating through all sheets in workbook using openxml(C#) but want to skip a specific sheet based on its name.Please suggest how it can be done.
The sheet names are stored in the WorkbookPart in a Sheets element
which has children of element Sheet which corresponds to each
worksheet in the Excel file.
See the answer given in this question.
Make your own loop and skip the sheet according to its name.

Import a second spread sheet into Microsoft.Office.Interop.Excel C# project

I'm stuck on the last hurdle to finish my program. I have a excel doc I want to import into the one I'm building in C#
wb.Sheets.Add();
Microsoft.Office.Interop.Excel.Worksheet staffCosts = (Microsoft.Office.Interop.Excel.Worksheet)wb.Worksheets[1];
staffCosts.Name = "Staff Costs";
staffCosts.QueryTables[1].Name = Path.GetFileNameWithoutExtension("C:\\tilldataoutput\\excelcreator\\excelcreator\\bin\\Debug\\Staff.xlsx");
Any help would be massively appreciated.
Take a look at this MSDN link, which states...
Returns the QueryTables collection that represents all the query
tables on the specified worksheet. Read-only.
Since you're getting a QueryTable by using an Index, you should check the collection first to check if any exist, or 2 in your case since you're looking at the second QueryTable.

How to get the 'first' sheet in OOXML with C# and the SDK?

SO! :) Simple question -- it's probably been asked, but I could not find it.
I am retrieving data from an XLSX using the Open XML SDK and C#.
I want to get the "first" sheet (as in the first one you would see in Excel), but when I use...
WorkbookPart wbPart = workBook.WorkbookPart;
//Now let's find the dimension of the first worksheet
string sheetArea = wbPart.WorksheetParts.First().Worksheet.SheetDimension.Reference.Value;
Unfortunately, in a brand-new XLSX this pulled "Sheet3" instead of "Sheet1". I do not know the sheet name ahead of time nor can I force the user to submit a workbook with only one sheet or specify sheet name. My present requirements are to grab the first sheet.
Can someone please help? :)
EDIT: I figured it out! But I can't answer my own question for 7 hours, so...
I found this by digging through answers on this other SO question:
Open XML SDK 2.0 - how to update a cell in a spreadsheet?
In essence, a working example might be this :
(wbPart.GetPartById(wbPart.Workbook.Sheets.Elements<Sheet>().First().Id.Value) as WorksheetPart).Worksheet.SheetDimension.Reference.Value
As far as I know, something like:
Sheet firstSheet = wbPart.Workbook.Descendants<Sheet>().First();
Worksheet firstWorksheet = ((WorksheetPart)wbPart.GetPartById(firstSheet.Id)).Worksheet;
Should return the first worksheet. The workbook Sheet descendants should always be sorted based on the order they appear in the workbook, at least in my experience.
If you wish to get the first visible, use:
Sheet firstSheet = wbPart.Workbook.Descendants<Sheet>()
.First(s => s.State == SheetStateValues.Visible);

Read excel sheet data in columns using OpenXML

Is there a way to read the excel sheet column wise rather than in rows using OpenXML-SDK & C#.
I have already tried using EPPlus package, but faced some problems because my application also uses ".xslm" files which are not supported by EPPlus. So, I need a solution in OpenXML for reading data in columns.
If anyone has a example, that will help.
Thanks
Sri
WorksheetPart worksheetPart = (WorksheetPart)document.WorkbookPart.GetPartById(sheets.First().Id);
// Get the cells in the specified column and order them by row.
IEnumerable<Cell> cells = worksheetPart.Worksheet.Descendants<Cell()
.Where(c => string.Compare(GetColumnName(c.CellReference.Value),
columnName, true) == 0).OrderBy(r => GetRowIndex(r.CellReference));
foreach (var cell in cells)
{
}

Categories