Reading very large excel file - c#

I am using this article to read a very large excel file, using SAX approach.
https://msdn.microsoft.com/en-us/library/office/gg575571.aspx
Can't store values in a DataTable or memory due to a client machine not having enough memory. Trying to read and right away store values into a database:
// The SAX approach.
static void ReadExcelFileSAX(string fileName)
{
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(fileName, false))
{
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
string text;
while (reader.Read())
{
if (reader.ElementType == typeof(CellValue))
{
text = reader.GetText();
Console.Write(text + " ");
}
}
Console.WriteLine();
Console.ReadKey();
}
}
For example when I read this excel file:
Test 1
22
345345
345345435
2333
333333
4444
4444444
324324
99999
I get this output:
Blank
22
Blank
345345
Blank
etc
I have no idea where the blanks are coming from. Tried to put the if statement in there testing for blanks then I miss the last value 99999.
That reader seems so limited. Would really appreciate a suggestion, I mean anything!

The OpenXmlReader treats the start and end elements as independant items. These can be differentiated by checking the IsStartElement and IsEndElement properties.
Your blank values are due to the end elements where GetText returns the empty string.
You have 2 options to fix it. Firstly you could check for IsStartElement in your loop:
while (reader.Read())
{
if (reader.ElementType == typeof(CellValue)
&& reader.IsStartElement)
{
text = reader.GetText();
Console.WriteLine(text + " ");
}
}
Alternatively you can use the LoadCurrentElement method to load the whole element, consuming both the start and end you were getting before:
while (reader.Read())
{
if (reader.ElementType == typeof(CellValue))
{
CellValue cellVal = (CellValue)reader.LoadCurrentElement();
Console.WriteLine(cellVal.Text);
}
}

Related

How to get a merged cell from Excel using DocumentFormat.OpenXML or ClosedXML C#

I need to get the area of the merged cell, the line number on which the area ends in Excel using only DocumentFormat.OpenXml or ClosedXML, how to do this for each cell?
Using ClosedXML, this could be done with:
var ws = workbook.Worksheet("Sheet 1");
var cell = ws.Cell("A2");
var mergedRange = cell.MergedRange();
var lastCell = mergedRange.LastCell();
// or
var lastCellAddress = mergedRange.RangeAddress.LastAddress;
I found this to be a little clunky but I believe this to be the correct approach.
private static void GetMergedCells()
{
var fileName = $"c:\\temp\\Data.xlsm";
// Open the document.
using (SpreadsheetDocument document = SpreadsheetDocument.Open(fileName, false))
{
// Get the WorkbookPart object.
var workbookPart = document.WorkbookPart;
// Get the first worksheet in the document. You can change this as need be.
var worksheet = workbookPart.Workbook.Descendants<Sheet>().FirstOrDefault();
// Retrieve the WorksheetPart using the Part ID from the previous "Sheet" object.
var worksheetPart = (WorksheetPart)workbookPart.GetPartById(worksheet.Id);
// Retrieve the MergeCells element, this will contain all MergeCell elements.
var mergeCellsList = worksheetPart.Worksheet.Elements<MergeCells>();
// Now loop through and spit out each range reference for the merged cells.
// You'll need to process the range either as a string or turn it into another
// object that gives you the end row.
foreach (var mergeCells in mergeCellsList)
{
foreach (MergeCell mergeCell in mergeCells)
{
Console.WriteLine(mergeCell.Reference);
}
}
}
}
If you couldn't already tell, this is using DocumentFormat.OpenXml.Spreadsheet

Open XML Removal of MS Word Table Rows Corrupting Images

I am trying to remove some rows in a table on a MS Word document. Below is how the table, before processing looks like:
I analyzed this table to understand the open XML representation the below is how the InnerText property is being formulated :
Items
Description
null
Classroom
empty
Interactive Classroom...
empty
empty
Case Study Classrooms ...
empty
empty
Auditoria Lecture Classrooms ...
Computers
empty
Mainframe Computer...
empty
empty
Supercomputer...
empty
empty
Workstation Computer...
The middle empty column is where the image is inserted. Image and the description are in two different cells, having an invisible border in between them.
Below is the code to remove items "Case Study Classrooms", "Supercomputer", "Workstation Computer","Personal Computer" and "Tablet".
var itemsToBeExcluded = new List<string>{"Case Study Classrooms", "Supercomputer", "Workstation Computer","Personal Computer","Tablet"};
using (MemoryStream stream = new MemoryStream())
{
//pageData is a byte[] to represent the word file
stream.Write(pageData, 0, (int)pageData.Length);
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(stream, true))
{
var table = wordDoc.MainDocumentPart.Document.Body.OfType<Table>().FirstOrDefault();
int rowCount = 0;
string firstColumnInnerXml = string.Empty;
for (int t = 0; t<table.ChildElements.Count; t++)
{
if(table.ChildElements[t] is TableRow)
{
// Skip the header
if (rowCount++ != 0)
{
// Gets the inner xml of first column of the table and set if it is null for the subsequent rows
if (table.ChildElements[t].ChildElements[1].InnerText.Length > 0)
{
firstColumnInnerXml = table.ChildElements[t].ChildElements[1].InnerXml;
}
else
{
table.ChildElements[t].ChildElements[1].InnerXml = firstColumnInnerXml;
}
foreach (var removableItem in itemsToBeExcluded)
{
if (table.ChildElements[t].ChildElements[3].InnerText.ToLower().StartsWith(removableItem.ToLower()))
{
table.ChildElements[t].Remove();
t--;
goto OUTERCONTINUE;
}
}
OUTERCONTINUE:;
}
}
}
wordDoc.MainDocumentPart.Document.Save();
wordDoc.Close();
}
}
However after execution, the below is what I am getting:
It is obvious that the image is missing, even though I am only removing the necessary rows, the images in the irrelevant rows are also seems to be corrupted/removed. Can someone explain why does this happen and how to solve this?

Read from Excel file — specific sheet

Simply put: I need to read from an xlsx file (rows), in the simplest way. That is, preferably without using third-party tools, or at least things that aren't available as nuget packages.
I've been trying for a while with IExcelDatareader, but I cannot figure out how to get data from a specific sheet.
This simple code snippet works, but it just reads the first worksheet:
FileStream stream = File.Open("C:\\test\\test.xlsx", FileMode.Open, FileAccess.Read);
IExcelDataReader excelReader = ExcelReaderFactory.CreateOpenXmlReader(stream);
excelReader.IsFirstRowAsColumnNames = true;
while (excelReader.Read()) {
Console.WriteLine(excelReader.GetString(0));
}
This prints the rows in the first worksheet, but ignores the others. Of course, there is nothing to suggest otherwise, but I cannot seem to find out how to specify the sheet name.
It strikes me that this should be quite easy?
Sorry for asking something which has been asked several times before, but the answer (here and elsewhere on the net) are a jungle of bad, plain wrong and outdated half-answers that's a nightmare to try and make sense of. Especially since almost everyone answering assumes that you know some specific details that are not always easy to find.
UPDATE:
As per daniell89's suggestion below, I've tried this:
FileStream stream = File.Open("C:\\test\\test.xlsx", FileMode.Open, FileAccess.Read);
IExcelDataReader excelReader = ExcelReaderFactory.CreateOpenXmlReader(stream);
excelReader.IsFirstRowAsColumnNames = true;
// Select the first or second sheet - this works:
DataTable specificWorkSheet = excelReader.AsDataSet().Tables[1];
// This works: Printing the first value in each column
foreach (var col in specificWorkSheet.Columns)
Console.WriteLine(col.ToString());
// This does NOT work: Printing the first value in each row
foreach (var row in specificWorkSheet.Rows)
Console.WriteLine(row.ToString());
Printing each column heading with col.ToString() works fine.
Printing the first cell of each row with row.ToString() results in this output:
System.Data.DataRow
System.Data.DataRow
System.Data.DataRow
...
One per row, so it's obviously getting the rows. But how to get the contents, and why does ToString() work for the columns and not for the rows?
Maybe look at this answer: https://stackoverflow.com/a/32522041/5358389
DataSet workSheets= reader.AsDataSet();
And then specific sheet:
DataTable specificWorkSheet = reader.AsDataSet().Tables[yourValue];
Enumerating rows:
foreach (var row in specificWorkSheet.Rows)
Console.WriteLine(((DataRow)row)[0]); // column identifier in square brackets
You need to get the Worksheet for the sheet you want to read data from. To get range A1 from Cars, for example:
var app = new Application();
Workbooks workbooks = app.Workbooks;
Workbook workbook = workbooks.Open(#"C:\MSFT Site Account Updates_May 2015.xlsx");
Worksheet sheet = workbook.Sheets["Cars"];
Range range = sheet.Range["A1"];
It is a late reply but i hope it will help someone
The script will be aiming at retrieving data from the first sheet and also to get the data of the first row
if (upload != null && upload.ContentLength > 0)
{
// ExcelDataReader works with the binary Excel file, so it needs a FileStream
// to get started. This is how we avoid dependencies on ACE or Interop:
Stream stream = upload.InputStream;
// We return the interface, so that
IExcelDataReader reader = null;
if (upload.FileName.EndsWith(".xls"))
{
reader = ExcelReaderFactory.CreateBinaryReader(stream);
}
else if (upload.FileName.EndsWith(".xlsx"))
{
reader = ExcelReaderFactory.CreateOpenXmlReader(stream);
}
else
{
ModelState.AddModelError("File", "This file format is not supported");
return View();
}
var result = reader.AsDataSet(new ExcelDataSetConfiguration()
{
ConfigureDataTable = (_) => new ExcelDataTableConfiguration()
{
UseHeaderRow = true
}
}).Tables[0];// get the first sheet data with index 0
var tables = result.Rows[0].Table.Columns;//we have to get a list of table headers here "first row" from 1 row
foreach(var rue in tables)// iterate through the header list and add it to variable 'Headers'
{
Headers.Add(rue.ToString());//Headers has been treated as a global variable "private List<string> Headers = new List<string>();"
}
var count = Headers.Count();// test if the headers have been added using count
reader.Close();
return View(result);
}
else
{
ModelState.AddModelError("File", "Please Upload Your file");
}

openXML overwrites formated cell in template

I have a formated template stored in the Database.
after building and opening the Excel the cell has the format but its not formated like it should.
example: the field looks in the template like this. 1234.56$ but know it is looking like this 1234.56. so the $ is missing.
second example. 12% its looking like but know its looking like this 11.9999999997%
The value I put in are exact values. like 1234.56 and 11.9999999997% so if i put them manually in the generatet excle it worsk with the formating but not during the creating phase.
does anyone have some ideas?
My insert statment
public static void InsertRows(List<ExcelRow> rowDefinitions, Stream template, string sheetName)
{
using (SpreadsheetDocument doc = SpreadsheetDocument.Open(template, true))
{
// tell Excel to recalculate formulas next time it opens the doc
doc.WorkbookPart.Workbook.CalculationProperties.ForceFullCalculation = true;
doc.WorkbookPart.Workbook.CalculationProperties.FullCalculationOnLoad = true;
foreach (var rd in rowDefinitions)
{
// first get the context (WS + SheetData)
var ws = GetWorksheetPart(doc.WorkbookPart, sheetName);
var sheetData = ws.Worksheet.Descendants<SheetData>().First();
var nr = CreateRow((uint)rd.RowIndex, sheetData);
foreach (var cd in rd.Cells)
{
var c = EnsureCell(nr, cd.ColumnName);
SetCellValue(cd.CellText, c, doc.WorkbookPart.SharedStringTablePart);
}
}
doc.WorkbookPart.Workbook.Save();
}
}

OpenXML WorksheetParts in reverse?

I have the following code
SpreadsheetDocument doc = SpreadsheetDocument.Open(name, true);
foreach (WorksheetPart wsP in doc.WorkbookPart.WorksheetParts)
{
SheetData sData = wsP.Worksheet.Descendants<SheetData>().First();
var cells = sData.Descendants<Cell>().Where(c => c.CellReference.Value == "A1");
if (cells.Count<Cell>() == 1)
{
int index = Convert.ToInt32(cells.First().CellValue.Text);
SharedStringItem str = doc.WorkbookPart.SharedStringTablePart.SharedStringTable.Elements<SharedStringItem>().ElementAt(index);
}
}
Sheet sheet = doc.WorkbookPart.Workbook.Sheets.FirstChild as Sheet;
doc.Close();
So far, I'm exploring OpenXML and just seeing how things work. I've come upon what seems to be an odd behavior. The file I am opening contains three sheets:
"Instructions", who's A1 cell is "fck ="
"Geometry", who's A1 cell is "PullDate"
"Stresses", who's A1 cell is "Section"
If I tinker with the latter part of the code (Sheet) to go to different sheets (by appending a .NextSibling at the end), that is precisely the order of sheet titles I get (by using the debugger).
However, the foreach WorksheetPart section on top goes backwards, returning me the A1 cells as "Section" and then "PullDate" and then "fck =".
Is this the expected behavior or am I doing something foolishly wrong and not noticing?

Categories