Finding rightmost cell in an Excel document in .Net - c#

I'm reading an Excel document via the DocumentFormat.OpenXml library. Is there a good way to find out how many columns it has?
The current code, which I've just come across while investigating a bug, does this:
public string getMaxColumnName(SheetData aSheetData)
{
string lLastCellReference = aSheetData.Descendants<Cell>().Last().CellReference.InnerText;
char[] lRowNumberIndex = lLastCellReference.IndexOfAny(new char[] { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' });
return lLastCellReference.Substring(0, lRowNumberIndex);
}
In English: find the last cell in the sheet, get its cell reference (like "CB99"), and retrieve everything before the first digit. The problem is that the last cell in the sheet is not necessarily in the rightmost column.
I have a test sheet that is a neat, rectangular table. It has 1000 rows filling columns A through M, so the function is supposed to return the string "M". But because there is an extraneous space character in cell C1522, that's counted as the last cell, so the function reports the max column as "C".
My initial impulse was to just replace that Last() call with some kind of Max(columnNumber). However, Cell apparently does not expose an actual column number, only this composite CellReference string. I don't think I want to be doing string-splitting inside a predicate there.
Is there a way to find the sheet's rightmost column, without having to parse the CellReference of every single cell?

As I understand the format, there are various cases:
If the file is not generated by Excel and the worksheet contains data in a way that there are no blank rows and there are no blank column within a row, but not necesarily every row has the same number of columns (which may be the case):
You are pretty much screwed. The format allow for rows and cells references to be ignored in this case. You have to count all cell references in each row to get the maximum.
If the file is not generated by Excel, but cells are populated sparse (which apparently is not the case):
The last cell of each row holds the reference of the column it must be in the "r" attribute. You will have to convert the reference, though.
If the file is generated by Excel:
Usually, and I haven't found an Excel-generated file that doesn't, the worksheet part has a child named dimension, which has a "ref" attribute with the cell reference used by the worksheet i.e. "A1:M1001". It is only a case of using this to know the columns. Of course, it works only if the extraneous character does not come in a column after the table.
Alternatively, every row usually, and every Excel-generated file I have seen has it, has an attribute called "spans" that has the columns that row uses. The "spans" attribute format is numeric, so in your example it would have a value "1:13" for every row in the table. Maybe you only have to check the first row this way.

I have concluded that this is the wrong thing to do in the first place. The consuming code is never actually looking for the rightmost cell in the whole sheet. Generally, what it wants is the number of cells in a particular row-- either row 1, or a known table header location.
In fact, with the possible exception of rendering or printing, I can't come up with any situation where getting the whole sheet's max cell is necessary.
Therefore, I need to refactor slightly. I'm changing the function so it takes a sheet and a row index, and returns the column of the rightmost cell in that row. That is, it will now look like:
public string getMaxColumnIndex(SheetData aSheetData, int aRowIndex);
For the implementation of that, I can check the Row.Spans property when it exists, or else parse the cell reference of Row.ChildElements.Last().

Related

How to find the maximum row and column WITHOUT reading all data?

Using C# .Net Google Sheets API.
I am new to the API, so I may have missed it in the docs - but how do you find out the maximum row and column that contain a value without reading all the data in the sheet?
For example, if a sheet contains multiple values and the "last" cell in the sheet with a value is at C139 (no cells in the rows following have a value and no cells in any column after C have a value), then the maximum row would be 139 and the maximum column would be 2 (zero based) or 3 (one based).
I tried sheet.Properties.GridProperties.RowCount -- but that gives the TOTAL number of rows in the sheet (whether the cells have values or not).
Same goes for sheet.Properties.GridProperties.ColumnCount -- gives the TOTAL number of columns in the sheet (whether the cells have values or not).
Any links or ideas are welcome.
I understand that you want to know the last row of data in your Sheet. In that case, you can use a simple GET with a full range. For example let's assume that your Sheet only has two columns, in that case you can set up the range like A1:B. That range will include the full two columns, but the get will only get as far as the data goes. At this step you already have an array filled with your data range, so you only have to count the array index of the last element in order to know the last row value. If you don't know how many columns your Sheet have, you only have to modify the range in a similar way as before (i.e. A1:Z). Please ask me any doubts about this approach.

ExcelDataReader in C# - How to reference an individual Cell using row and column cordinates

I'm reading an .xlsx spreadsheet into a C# console app with a view to outputting the content as a formatted xml file (to be picked up by another part of the system further down the line).
The problem with the the .xslx file is that it's a pro-forma input document based on, and replacing, an old paper-based order form we used to provide to customers, and the input fields aren't organised as a series of similar rows (except in the lower part of the document which consists of up to 99 rows of order detail lines). Some of the rows in the header part of the form/sheet are a mixture of label text AND data; same with the columns.
Effectively, what I need to do is to be able to cherry pick data from the initial dozen or so rows in order to poke data into the xml structure; the latter part of the document I can process by iterating over the rows for the order detail lines.
I can't use Interop as this will end up as an Azure function - so I've used ExcelDataReader to convert the spreadsheet to a dataset, then convert that dataset to a new dataset entirely composed of string values. But I haven't been able to successfully point to individual cells as I had expected to be using syntax something like
var cellValue = MyDataSet.Cell[10, 2];
I'd be grateful for any advice as to how I might get the result I need.
A Dataset has Tables and those have Rows which hold ColumnValues
A WorkSheet transforms into a Table (with Columns) and the Cells transform to Rows and column values.
To find the cell value at [10,2] on the first Worksheet do:
var cellValue = MyDataSet.Tables[0].Rows[10][2];
Remember that cellValue will be of type object. Cast accordingly.

Cannot detect merged cells through Range.MergeCells Property for Excel

When I send data to Excel it ignores the merged "property" of some cells and just writes to the first cell it finds. So assuming I have column A and column B merged and I am sending data to column A and C, it actually splits the merged column so I am left with an empty column B.
Here is some code for context (some variables have been kept generic):
Range cells = this.Worksheet.Cells;
Range cell = (Range)cells[rowIndex, columnIndex];
Boolean merged = (Boolean)cell.MergeCells; //Here I am trying to determine if the
//cell is merged.
My problem is that .MergeCells always returns false. What am I doing wrong here? I know that in the Excel worksheet the cells are merged.
The problem is you are casting to a boolean, and MergeCells is not always guaranteed to give you back a boolean, as outlined in this more recent question: how to detect merged cells in c# using MS interop excel. You need to also check for the value of null - see the linked question for how to do that.
Hypothesis
So what's probably happening to your code is the null value casts back to false, even though what the null value actually indicates is that there are merged cells in the range.
The answer is: Your code is correct.
Boolean merged = (Boolean)cell.MergeCells; //Cast from dynamic{bool} to bool
This works for me (Excel 2013 on Windows 7).
I have noticed both true and false values in my own tests.
So maybe your worksheet's cells just DO NOT CONTAIN a merged cell!?

Using Excel interop to retrieve information from the range that meet criteria

oSheet = (Excel._Worksheet)xlWorkBook.ActiveSheet;
oRng = oSheet.get_Range("T10", "T343");
The range oRng contains values of type double. Each cell in Column T shows the max number of the associated row. How can i find out how many 1’s , 2’s 3’s ….. till 10 are in that range. Secondly Eg if there are 20 rows with value =3 , I need to copy column A,B,C from those rows and store them for later use .i need the count of the number of rows for each value from 1 to 10
Here are a few general observations that might be enough to get you going:
Excel.Range has an AutoFilter method that you might be able to employ successively for each value that you're interested in (i.e, 1 through 10). Once you have the individual ranges returned by AutoFilter, you can them query them for the specific information you're interested in. See C# Excel Automation: Retrieving rows after AutoFilter() with SpecialCells() does not seem to work properly for issues associated with this approach.
Alternatively, you might be able to do something like create a simple dictionary that you could then populate as you iterate over column T. For example, the dictionary could be of type Dictionary>.
As you proceed through column T, you encounter a value in each cell. If the cell value hasn't been seen before, you add it as a new key to the Dictionary. For the associated value in the dictionary's key/value pair, you create a new List with the corresponding row number as its first element.
If the cell value has been seen before, you look it up in the dictionary, then add the corresponding row to the List associated with that key.
At the end of the day, your dictionary's keys contains all the values found in column T. The number of rows associated with each value is just the number of elements in the associated List. With the row values in the List, you can then find "A[row value]", "B[row value]" and "C[row value]".

The right data structure to use for an Excel clone

Let say I'm working on an Excel clone in C#.
My grid is represented as follows:
private struct CellValue
{
private int column;
private int row;
private string text;
}
private List<CellValue> cellValues = new List<CellValue>();
Each time user add a text, I just package it as CellValue and add it into cellValues. Given a CellValue type, I can determine its row and column in O(1) time, which is great. However, given a column and a row, I need to loop through the entire cellValues to find which text is in that column and row, which is terribly slow. Also, given a text, I too need to loop through the entire thing. Is there any data structure where I can achive all 3 task in O(1) time?
Updated:
Looking through some of the answers, I don't think I had found one that I like. Can I:
Not keeping more than 2 copies of CellValue, in order to avoid sync-ing them. In C world I would have made nice use of pointers.
Rows and Columns can be dynamically added (Unlike Excel).
I would opt for a sparse array (a linked list of linked lists) to give maximum flexibility with minimum storage.
In this example, you have a linked list of rows with each element pointing to a linked list of cells in that row (you could reverse the cells and rows depending on your needs).
|
V
+-+ +---+ +---+
|1| -> |1.1| ----------> |1.3| -:
+-+ +---+ +---+
|
V
+-+ +---+
|7| ----------> |7.2| -:
+-+ +---+
|
=
Each row element has the row number in it and each cell element has a pointer to its row element, so that getting the row number from a cell is O(1).
Similarly, each cell element has its column number, making that O(1) as well.
There's no easy way to get O(1) for finding immediately the cell at a given row/column but a sparse array is as fast as it's going to get unless you pre-allocate information for every possible cell so that you can do index lookups on an array - this would be very wasteful in terms of storage.
One thing you could do is make one dimension non-sparse, such as making the columns the primary array (rather than linked list) and limiting them to 1,000 - this would make the column lookup indexed (fast), then a search on the sparse rows.
I don't think you can ever get O(1) for a text lookup simply because text can be duplicated in multiple cells (unlike row/column). I still believe the sparse array will be the fastest way to search for text, unless you maintain a sorted index of all text values in another array (again, that can make it faster but at the expense of copious amounts of memory).
I think you should use one of the indexed collections to make it work reasonably fast, the perfect one is the KeyedCollection
You need to create your own collection by extending this class. This way your object will still contain row and column (so you will not loose anything), but you will be able to search for them. Probably you will have to create a class encapsulating (row, column) and make it the key (so make it immutable and override equals and get hash code)
I'd create
Collection<Collection<CellValue>> rowCellValues = new Collection<Collection<CellValue>>();
and
Collection<Collection<CellValue>> columnCellValues = new Collection<Collection<CellValue>>();
The outer collection has one entry for each row or column, indexed by the row or column number, the inner collection has all the cells in that row or column. These collections should be populated as part of the process that creates new CellValue objects.
rowCellValues[newCellValue.Row].Add(newCellValue);
columnCellValues[newCellValue.Column].Add(newCellValue);
This smells of premature optimization.
That said, there's a few features of excel that are important in choosing a good structure.
First is that excel uses the cells in a moderately non-linear fashion. The process of resolving formulas involves traversing the spreadsheets in effectively random order. The structure will need a mechanism of easily looking up values of random keys cheaply, marking them dirty, resolved, or unresolvable due to circular reference. It will also need some way to know when there are no more unresolved cells left, so that it can stop working. Any solution that involves a linked list is probably sub-optimal for this, since they would require a linear scan to get those cells.
Another issue is that excel displays a range of cells at one time. This may seem trivial, and to a large extent it is, but It will certainly be ideal if the app can pull all of the data needed to draw a range of cells in one shot. part of this may be keeping track of the display height and width of the rows and columns, so that the display system can iterate over the range until the desired width and height of cells has been collected. The need to iterate in this manner may preclude the use of a hashing strategy for sparse storage of cells.
On top of that, there are some weaknesses of the representational model of spreadsheets that could be addressed much more effectively by taking a slightly different approach.
For example, column aggregates are sort of clunky. A column total is easy enough to implement in excel, but it has a sort of magic behavior that works most of the time but not all of the time. For instance, if you add a row into the aggregated area, further calculations on that aggregate may continue to work, or not, depending on how you added it. If you copy and insert a row (and replace the values) everything works fine, but if you cut and paste the cells one row down, things don't work out so well.
Given that the data is 2-dimensional, I would have a 2D array to hold it in.
Well, you could store them in three Dictionaries: two Dictionary<int,CellValue> objects for rows and columns, and one Dictionary<string,CellValue> for text. You'd have to keep all three carefully in sync though.
I'm not sure that I wouldn't just go with a big two-dimensional array though...
If it's an exact clone, then an array-backed list of CellValue[256] arrays. Excel has 256 columns, but a growable number of rows.
If rows and columns can be added "dynamically", then you shouldn't store the row/column as an numeric attribute of the cell, but rather as a reference to a row or column object.
Example:
private struct CellValue
{
private List<CellValue> _column;
private List<CellValue> _row;
private string text;
public List<CellValue> column {
get { return _column; }
set {
if(_column!=null) { _column.Remove(this); }
_column = value;
_column.Add(this);
}
}
public List<CellValue> row {
get { return _row; }
set {
if(_row!=null) { _row.Remove(this); }
_row = value;
_row.Add(this);
}
}
}
private List<List<CellValue>> MyRows = new List<List<CellValue>>;
private List<List<CellValue>> MyColumns = new List<List<CellValue>>;
Each Row and Column object is implemented as a List of the CellValue objects. These are unordered--the order of the cells in a particular Row does not correspond to the Column index, and vice-versa.
Each sheet has a List of Rows and a list of Columns, in order of the sheet (shown above as MyRows and MyColumns).
This will allow you to rearrange and insert new rows and columns without looping through and updating any cells.
Deleting a row should loop through the cells on the row and delete them from their respective columns before deleting the row itself. And vice-versa for columns.
To find a particular Row and Column, find the appropriate Row and Column objects, then find the CellValue that they contain in common.
Example:
public CellValue GetCell(int rowIndex, int colIndex) {
List<CellValue> row = MyRows[rowIndex];
List<CellValue> col = MyColumns[colIndex];
return row.Intersect(col)[0];
}
(I'm a little fuzzy on these Extension methods in .NET 3.5, but this should be in the ballpark.)
If I recall correctly, there was an article about how Visicalc did it, maybe in Byte Magazine in the early 80s. I believe it was a sparse array of some sort. But I think there were links both up-and-down and left-and-right, so that any given cell had a pointer to the cell above it (however many cells away that may be), below it, to the left of it, and to the right of it.

Categories