Read from Excel file — specific sheet - c#

Simply put: I need to read from an xlsx file (rows), in the simplest way. That is, preferably without using third-party tools, or at least things that aren't available as nuget packages.
I've been trying for a while with IExcelDatareader, but I cannot figure out how to get data from a specific sheet.
This simple code snippet works, but it just reads the first worksheet:
FileStream stream = File.Open("C:\\test\\test.xlsx", FileMode.Open, FileAccess.Read);
IExcelDataReader excelReader = ExcelReaderFactory.CreateOpenXmlReader(stream);
excelReader.IsFirstRowAsColumnNames = true;
while (excelReader.Read()) {
Console.WriteLine(excelReader.GetString(0));
}
This prints the rows in the first worksheet, but ignores the others. Of course, there is nothing to suggest otherwise, but I cannot seem to find out how to specify the sheet name.
It strikes me that this should be quite easy?
Sorry for asking something which has been asked several times before, but the answer (here and elsewhere on the net) are a jungle of bad, plain wrong and outdated half-answers that's a nightmare to try and make sense of. Especially since almost everyone answering assumes that you know some specific details that are not always easy to find.
UPDATE:
As per daniell89's suggestion below, I've tried this:
FileStream stream = File.Open("C:\\test\\test.xlsx", FileMode.Open, FileAccess.Read);
IExcelDataReader excelReader = ExcelReaderFactory.CreateOpenXmlReader(stream);
excelReader.IsFirstRowAsColumnNames = true;
// Select the first or second sheet - this works:
DataTable specificWorkSheet = excelReader.AsDataSet().Tables[1];
// This works: Printing the first value in each column
foreach (var col in specificWorkSheet.Columns)
Console.WriteLine(col.ToString());
// This does NOT work: Printing the first value in each row
foreach (var row in specificWorkSheet.Rows)
Console.WriteLine(row.ToString());
Printing each column heading with col.ToString() works fine.
Printing the first cell of each row with row.ToString() results in this output:
System.Data.DataRow
System.Data.DataRow
System.Data.DataRow
...
One per row, so it's obviously getting the rows. But how to get the contents, and why does ToString() work for the columns and not for the rows?

Maybe look at this answer: https://stackoverflow.com/a/32522041/5358389
DataSet workSheets= reader.AsDataSet();
And then specific sheet:
DataTable specificWorkSheet = reader.AsDataSet().Tables[yourValue];
Enumerating rows:
foreach (var row in specificWorkSheet.Rows)
Console.WriteLine(((DataRow)row)[0]); // column identifier in square brackets

You need to get the Worksheet for the sheet you want to read data from. To get range A1 from Cars, for example:
var app = new Application();
Workbooks workbooks = app.Workbooks;
Workbook workbook = workbooks.Open(#"C:\MSFT Site Account Updates_May 2015.xlsx");
Worksheet sheet = workbook.Sheets["Cars"];
Range range = sheet.Range["A1"];

It is a late reply but i hope it will help someone
The script will be aiming at retrieving data from the first sheet and also to get the data of the first row
if (upload != null && upload.ContentLength > 0)
{
// ExcelDataReader works with the binary Excel file, so it needs a FileStream
// to get started. This is how we avoid dependencies on ACE or Interop:
Stream stream = upload.InputStream;
// We return the interface, so that
IExcelDataReader reader = null;
if (upload.FileName.EndsWith(".xls"))
{
reader = ExcelReaderFactory.CreateBinaryReader(stream);
}
else if (upload.FileName.EndsWith(".xlsx"))
{
reader = ExcelReaderFactory.CreateOpenXmlReader(stream);
}
else
{
ModelState.AddModelError("File", "This file format is not supported");
return View();
}
var result = reader.AsDataSet(new ExcelDataSetConfiguration()
{
ConfigureDataTable = (_) => new ExcelDataTableConfiguration()
{
UseHeaderRow = true
}
}).Tables[0];// get the first sheet data with index 0
var tables = result.Rows[0].Table.Columns;//we have to get a list of table headers here "first row" from 1 row
foreach(var rue in tables)// iterate through the header list and add it to variable 'Headers'
{
Headers.Add(rue.ToString());//Headers has been treated as a global variable "private List<string> Headers = new List<string>();"
}
var count = Headers.Count();// test if the headers have been added using count
reader.Close();
return View(result);
}
else
{
ModelState.AddModelError("File", "Please Upload Your file");
}

Related

Open XML Removal of MS Word Table Rows Corrupting Images

I am trying to remove some rows in a table on a MS Word document. Below is how the table, before processing looks like:
I analyzed this table to understand the open XML representation the below is how the InnerText property is being formulated :
Items
Description
null
Classroom
empty
Interactive Classroom...
empty
empty
Case Study Classrooms ...
empty
empty
Auditoria Lecture Classrooms ...
Computers
empty
Mainframe Computer...
empty
empty
Supercomputer...
empty
empty
Workstation Computer...
The middle empty column is where the image is inserted. Image and the description are in two different cells, having an invisible border in between them.
Below is the code to remove items "Case Study Classrooms", "Supercomputer", "Workstation Computer","Personal Computer" and "Tablet".
var itemsToBeExcluded = new List<string>{"Case Study Classrooms", "Supercomputer", "Workstation Computer","Personal Computer","Tablet"};
using (MemoryStream stream = new MemoryStream())
{
//pageData is a byte[] to represent the word file
stream.Write(pageData, 0, (int)pageData.Length);
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(stream, true))
{
var table = wordDoc.MainDocumentPart.Document.Body.OfType<Table>().FirstOrDefault();
int rowCount = 0;
string firstColumnInnerXml = string.Empty;
for (int t = 0; t<table.ChildElements.Count; t++)
{
if(table.ChildElements[t] is TableRow)
{
// Skip the header
if (rowCount++ != 0)
{
// Gets the inner xml of first column of the table and set if it is null for the subsequent rows
if (table.ChildElements[t].ChildElements[1].InnerText.Length > 0)
{
firstColumnInnerXml = table.ChildElements[t].ChildElements[1].InnerXml;
}
else
{
table.ChildElements[t].ChildElements[1].InnerXml = firstColumnInnerXml;
}
foreach (var removableItem in itemsToBeExcluded)
{
if (table.ChildElements[t].ChildElements[3].InnerText.ToLower().StartsWith(removableItem.ToLower()))
{
table.ChildElements[t].Remove();
t--;
goto OUTERCONTINUE;
}
}
OUTERCONTINUE:;
}
}
}
wordDoc.MainDocumentPart.Document.Save();
wordDoc.Close();
}
}
However after execution, the below is what I am getting:
It is obvious that the image is missing, even though I am only removing the necessary rows, the images in the irrelevant rows are also seems to be corrupted/removed. Can someone explain why does this happen and how to solve this?

Just need to return distinct values from Excel using EPPlus

All,
Been trying to figure this out for a day now. Did a lot of googling!
I have an excel where I have 5 columns but in first column I have product numbers. I want to return DISTINCT product numbers from the excel. Using EPPlus to read in the excel. Here is my code:
string fileName = file.FileName;
string fileContentType = file.ContentType;
byte[] fileBytes = new byte[file.ContentLength];
var data = file.InputStream.Read(fileBytes, 0, Convert.ToInt32(file.ContentLength));
if (file.FileName.IndexOf(".xlsx") == 0)
{
throw new Exception("Please ensure that the file has been converted to latest excel version. The file type must be .xlsx.");
}
using (var package = new ExcelPackage(file.InputStream))
{
var currentSheet = package.Workbook.Worksheets;
var workSheet = currentSheet.FirstOrDefault();
var noOfCol = workSheet.Dimension.End.Column;
var noOfRow = workSheet.Dimension.End.Row;
//lets remove all records
//get a list of distinct item numbers and remove all records in preparation for upload
//I need help with this statement!
var result = workSheet.Cells.Select(grp => grp.First()).Distinct().ToList();
So I was able to figure it out by debugging. This doesnt seem to be the most efficient answer but here it goes:
var result = workSheet.Cells.Where(s => s.Address.Contains("A")).Where(v => v.Value != null).Where(vb => vb.Value.ToString() != "").GroupBy(g => g.Value.ToString()).Distinct().ToList();
So basically return Only column A (First column since address holds this information) then eliminate nulls and blanks, next group by the value and finally return distinct as a list.
Regarding your answer (sorry not enough rep to comment):
workSheet.Cells.Where(s => s.Address.Contains("A")).....
That could include ZA, AA, etc If you just want column A you could do
workSheet.Cells[1,1,workSheet.Dimension.End.Row, 1].....
This will start at A1, and just look down column A till the end. You'll still might need to filter null, blank etc, or if you need to start at row 5 here is all i needed. exmaple:
workSheet.Cells[5,1,workSheet.Dimension.End.Row, 1].GroupBy(g => g.Value.ToString()).Distinct().ToList();

Using Open XML to read an Excel spreadsheet, how do I determine the sheet that a Table is on?

If I have a loaded SpreadsheetDocument instance:
SpreadsheetDocument spreadsheetDocument
and iterate over the WorksheetParts:
foreach (var wp in spreadsheetDocument.WorkbookPart.WorksheetParts)
for every part that is a "Table" I can get to the table definition with:
wp.TableDefinitionParts
and grab the first entry. At this point I can grab the table name:
var tableName = tableDefinitionPart.Table.Name;
But how do I determine which sheet this this table is located in?
Given a WorksheetPart (as assigned to wp in your code), the first entry Parts list will be an Packaging.IdPartPair object:
var parts = wp.Parts.ToList();
var idPartPair = parts[0];
If you take a look at the value of
idPartPair.OpenXmlPart.Uri.OriginalString
it will be a string that looks like this:
/xl/tables/table2.xml
The only thing you care about is the number 2 in that string. Believe it or not, that's actually saying that the table is in the third sheet of the workbook (zero-based)
At this point, write your favorite code to extract the 2 out of the above code. My version is this, but I'm sure someone else can make this shorter:
var sheetNo = int.Parse(string.Concat(Path.GetFileNameWithoutExtension(idPartPair.OpenXmlPart.Uri.OriginalString).Skip(5)));
Next, get the list of sheets:
var sheets = spreadsheetDocument.WorkbookPart.Workbook.Sheets.ToList();
Then use sheetNo to index into it:
var sheet = (Sheet)sheets[sheetNo];
Then you can easily get the sheet name:
var sheetName = sheet.Name;

Reading very large excel file

I am using this article to read a very large excel file, using SAX approach.
https://msdn.microsoft.com/en-us/library/office/gg575571.aspx
Can't store values in a DataTable or memory due to a client machine not having enough memory. Trying to read and right away store values into a database:
// The SAX approach.
static void ReadExcelFileSAX(string fileName)
{
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(fileName, false))
{
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
string text;
while (reader.Read())
{
if (reader.ElementType == typeof(CellValue))
{
text = reader.GetText();
Console.Write(text + " ");
}
}
Console.WriteLine();
Console.ReadKey();
}
}
For example when I read this excel file:
Test 1
22
345345
345345435
2333
333333
4444
4444444
324324
99999
I get this output:
Blank
22
Blank
345345
Blank
etc
I have no idea where the blanks are coming from. Tried to put the if statement in there testing for blanks then I miss the last value 99999.
That reader seems so limited. Would really appreciate a suggestion, I mean anything!
The OpenXmlReader treats the start and end elements as independant items. These can be differentiated by checking the IsStartElement and IsEndElement properties.
Your blank values are due to the end elements where GetText returns the empty string.
You have 2 options to fix it. Firstly you could check for IsStartElement in your loop:
while (reader.Read())
{
if (reader.ElementType == typeof(CellValue)
&& reader.IsStartElement)
{
text = reader.GetText();
Console.WriteLine(text + " ");
}
}
Alternatively you can use the LoadCurrentElement method to load the whole element, consuming both the start and end you were getting before:
while (reader.Read())
{
if (reader.ElementType == typeof(CellValue))
{
CellValue cellVal = (CellValue)reader.LoadCurrentElement();
Console.WriteLine(cellVal.Text);
}
}

parse excel file best practise

I am facing an issue in parsing excel file. My file has more than 5000 rows. When I parse it, its taking ages I wanted to ask if there's any better way to do so.
public static List<List<List<string>>> ExtractData(string filePath)
{
List<List<List<string>>> Allwork = new List<List<List<string>>>();
Microsoft.Office.Interop.Excel.Application excelApp = new Microsoft.Office.Interop.Excel.Application();
Microsoft.Office.Interop.Excel.Workbook workBook = excelApp.Workbooks.Open(filePath);
foreach (Microsoft.Office.Interop.Excel.Worksheet sheet in workBook.Worksheets)
{
List<List<string>> Sheet = new List<List<string>>();
Microsoft.Office.Interop.Excel.Range usedRange = sheet.UsedRange;
//Iterate the rows in the used range
foreach (Microsoft.Office.Interop.Excel.Range row in usedRange.Rows)
{
List<string> Rows = new List<string>();
String[] Data = new String[row.Columns.Count];
for (int i = 0; i < row.Columns.Count; i++)
{
try
{
Data[i] = row.Cells[1, i + 1].Value2.ToString();
Rows.Add(row.Cells[1, i + 1].Value2.ToString());
}
catch
{
Rows.Add(" ");
}
}
Sheet.Add(Rows);
}
Allwork.Add(Sheet);
}
excelApp.Quit();
return Allwork;
}
This is my code.
Your issue is that you are reading one cell at a time, this is very costly and inefficient try reading a range of cells.
Simple example below
Excel.Range range = worksheet.get_Range("A"+i.ToString(), "J" + i.ToString());
System.Array myvalues = (System.Array)range.Cells.Value;
string[] strArray = ConvertToStringArray(myvalues);
A link to basic example
Read all the cell values from a given range in excel
I suggest not use interop, but odbc connection for getting excel data. This will allow you to treat excel file as database and use sql statements to read needed data.
If that's an option, and if your tables have a simple structure, I would suggest to try exporting the file to .csv and applying simple string processing logic.
You might also want to try out the Igos's sugestion.
One approach is to use something like the ClosedXML library to directly read the .xlsx file, not going through the Excel interop.

Categories