Excel Interop Open/Repair HResult exception - c#

What I do: populate & format an Excel file using a mix of Interop and ClosedXML.
First, the file is populated via Interop, then saved, closed, then I format the cells' RichText using ClosedXML.
Unfortunately, this formatting causes Excel to view my file as "corrupt" and needs to repair it.
This is the relevant part:
var workbook = new XLWorkbook(xlsPath);
var sheet = workbook.Worksheet("Error Log");
for (var rownum = 2; rownum <= 10000; rownum++)
{
var oldcell = sheet.Cell("C" + rownum);
var newcell = sheet.Cell("D" + rownum);
var oldtext = oldcell.GetFormattedString();
if(string.IsNullOrEmpty(oldtext.Trim()))
break;
XlHelper.ColorCellText(oldcell, "del", System.Drawing.Color.Red);
XlHelper.ColorCellText(newcell, "add", System.Drawing.Color.Green);
}
workbook.Save();
And the colouring method:
public static void ColorCellText(IXLCell cel, string tagName, System.Drawing.Color col)
{
var rex = new Regex("\\<g\\sid\\=[\\sa-z0-9\\.\\:\\=\\\"]+?\\>");
var txt = cel.GetFormattedString();
var mc = rex.Matches(txt);
var xlcol = XLColor.FromColor(col);
foreach (Match m in mc)
{
txt = txt.Replace(m.Value, "");
txt = txt.Replace("</g>", "");
}
var startTag = string.Format("[{0}]", tagName);
var endTag = string.Format("[/{0}]", tagName);
var crt = cel.RichText;
crt.ClearText();
while (txt.Contains(startTag) || txt.Contains(endTag))
{
var pos1 = txt.IndexOf(startTag);
if (pos1 == -1)
pos1 = 0;
var pos2 = txt.IndexOf(endTag);
if (pos2 == -1)
pos2 = txt.Length - 1;
var txtLen = pos2 - pos1 - 5;
crt.AddText(txt.Substring(0, pos1));
crt.AddText(txt.Substring(pos1 + 5, txtLen)).SetFontColor(xlcol);
txt = txt.Substring(pos2 + 6);
}
if (!string.IsNullOrEmpty(txt))
crt.AddText(txt);
}
Error in file myfile.xlsx
The following repairs were performed: _x000d__x000a__x000d__x000a_
Repaired records:
string properties of /xl/sharedStrings.xml-Part (strings)
I've been through all the xmls looking for clues. In the affected sheet, in comparison view of Productivity Tool, some blocks appear as inserted in the repaired file and deleted in the corrupt one, although nothing significant seemed changed - except for one thing: the style attribute of that cell. Here an example:
<x:c r="AA2" s="59">
<x:f>
(IFERROR(VLOOKUP(G2,Legende!$A$42:$B$45,2,FALSE),0))
</x:f>
</x:c>
I have checked the styles.xml for style 59, but there is none. In the repaired file, this style has been changed to 14, which in my styles.xml is listed as a number format.
Unfortunately, a global search/replace of these invalid style indexes did not resolve the issue.
Seeing the things going on here with corrupt indexes, renamed xmls, invalid named ranges etc., I took a different route: not to use interop at all, maybe the corruption was caused by Excel in the first place and the coloring was only the last straw.
Using ClosedXml only:
Wow. Just wow. This makes it even worse. I commented out the colouring part since without that, Interop produced a readable file without errors, so that's what I expect of ClosedXml too.
This is how I open the file and address the worksheet with ClosedXml:
var wb= new XLWorkbook(xlsPath);
var errors = wb.Worksheet("Error Log");
This is how I write the values into the file:
errors.Cell(zeile, 1).SetValue(fname);
With zeile being a simple int counter.
I then dare to set a column width:
errors.Column(2).Width = 50;
errors.Column(3).Width = 50;
errors.Column(4).Width = 50;
As well as setting some values in another sheet in exactly the same fashion before saving with validation.
wb.Save(true);
wb.Dispose();
Lo and behold: The validation throws errors:
Attribute 'name' should have unique value. Its current value 'Legende duplicates with others.
Attribute 'sheetId' should have unique value. Its current value '4' duplicates with others.
A couple more errors like attribute 'top' having invalid value '11.425781'.
Excel cannot open the file directly, must repair it. My Sheet "Legende" is now empty and the first sheet instead of third, and I get an additional fourth sheet "Restored_Table1" which contains my original "Legende" contents.
What the hell is going on with this file??
New attempt: re-create the Excel template from scratch - in LibreOffice.
I now think that the issue is entirely misleading. If I use the newly created file from LibreOffice, the validation causes a System.OutOfMemory exception due to too many validation errors. Opening in Excel requires repair, gives additional sheet and so forth.
Creating in LibreOffice, then opening in Excel, saving, then using that file as template produces a much better result albeit not perfect yet.
Since I copied parts over from the old Excel file into LO while creating the new file, I assume some corrupt remnant got copied over.
I cannot shake the feeling that this is the file itself after all and has nothing to do with how I edit it!
Will post updaate tomorrow.

OK. Stuff this.
I created a completely fresh file with LibreOffice, making sure not to copy over anything at all from the original file, and I ditched Interop in favour of ClosedXml.
=> This produced a corrupt file in which my first sheet was cleared and its contents move to a "Restored_Table1".
After I opened my fresh new template with Excel via Open/Repair and saved it, the resulting, uncoloured file was NOT corrupt.
=> Colouring it produces the "original" corruption, all sheets intact.
ClosedXml seems to be marginally slower than Interop but at this point I couldn't care less. I guess we will have to live with the "corrupt" message and just get on with it.
I hate xlsx.

Related

NPOI issue with FirstRowNum and LastRowNum returning -1 for a sheet that has rows

When trying to get the rows from an Excel sheet using NPOI, FirstRowNum and LastRowNum return -1.
IWorkbook workbook = null;
List<ImportedKPI> excelRows = new List<ImportedKPI>();
MemoryStream ms = new MemoryStream(array);
ISheet sheet = null;
workbook = WorkbookFactory.Create(ms);
sheet = workbook.GetSheet(mapping.Sheet);
//Do some stuff here and try to get rows
for (int i = sheet.FirstRowNum; i <= sheet.LastRowNum; i++)
//sheet.FirstRowNum = -1 && sheet.LastRowNum = -1
Another worthy mention is that when I save open the file and close it (with LibreOffice) it asks if I want to save and after that it works.
When comparing the byte arrays before/after save they are different.
Turns out not all excel parsers work for all types.
And since I get excels by email I had to do a chain of responsibility pattern based on NPOI and ExcelDataReader nugets.
This means that the when the excel comes I try to use a first class to parse it (ExcelDataReader first) and if it's not successful it moves on to NPOI and tries to parse it with that.
Seems like the better way when there are many sources.

Epplus generate invalid document after using condition

I'm using Epplus for generate a xlsx file, all working well before that I added this code:
var address = new ExcelAddress("G2:G5");
var condition = ws.ConditionalFormatting.AddExpression(address);
condition.Style.Font.Color.Color = Color.Red;
condition.Formula = string.Format("IF(G{0} < 25, 1, 0", 1);
essentially I'm trying to apply a different color for each cell, based on the value contained in the cell.
The file is generated correctly, but, when I open it Excel say that the file is corrupted.
As you can see I used as address G2:G5, but I also need to know how can I add a range of column between G to Y, I've a number of rows variable so I don't know the exact number to specify.
Someone knows what's the problem? Thanks.
For starters, you need proper syntax. You are missing a closing brace at the end of your formula.
|
V
condition.Formula = string.Format("IF(G{0} < 25, 1, 0)", 1);

ExcelWorksheet.UsedRange is counting wrong if file has empty rows on top

I have c# windows application that is reading files content. I wanted to extract values from used rows only.
I am using this code:
int rows = ExcelWorksheet.UsedRange.Rows.Count;
Everything works fine. Except when I have empty rows on top, the counting will be incorrect.
-File has no special characters, formula or such. Just plain text on it.
-The application can read excel xls and xlsx with no issue if the file has no empty rows on top.
Okay, now I've realized I'm doing it all wrong. Of course it will not read all of my UsedRange.Rows because in my for loop, I am starting the reading always on the first row. So I get the ((Microsoft.Office.Interop.Excel.Range)(ExcelWorksheet.UsedRange)).Row; as a starting point of reading
This code works:
int rows = ExcelWorksheet.UsedRange.Rows.Count;
int fRowIndex = ((Microsoft.Office.Interop.Excel.Range)(ExcelWorksheet.UsedRange)).Row;
int rowCycle = 1;
for (int rowcounter = fRowIndex; rowCycle <= rows; rowcounter++)
{
//code for reading
}
Instead of read Excel row-by-row, better to get it in C# as a Range, and then handle it as
Sheet.UsedRange.get_Value()
for whole UsedRange in Sheet. Whenever you'd like to get a part of UsedRange, do it as
Excel.Range cell1 = Sheet.Cells[r0, c0];
Excel.Range cell2 = Sheet.Cells[r1, c1];
Excel.Range rng = Sheet.Range[cell1, cell2];
var v = rng.get_Value();
You well know size of v in C# memory from the values of [r1-r0, c1-c0]

Excel: Losing decimal separator when converting from strings to number

I am trying to read some values from several files and save them in a new .xlsx file with different grouping. I devised a very simple setup to test different formatting and behavior with null values. I always open just-created file in Excel to see outcome. So far no problem.
However in my test-case I can achieve either: A) save the test values as they are (strings) or B) force Excel to regard them as numbers with given format (good), but lose decimal separator (very bad & strange).
I had traced problem to the last line in a code snippet below. The idea of self-assign is from another post somewhere here at SO but right now I am unable to find it.
If the line is commented-out the results are as in a string[,] contents only they are formatted as text (and Excel complains about this with "number formatted as text" message). If I uncomment it, the numbers are regarded as numbers but lose decimal separators. Also the problem might be a fact that I am in Czech Republic and decimal separator is , which might trouble Excel. Moreover, reading the values from start into a double[,] contents is out, since I need to indicate whether value is absent (with empty cell). And double?[,] contents crashes Excel...
Please, havenĀ“t you met this behavior before? I would like to 1) be able to indicate missing value and 2) have contents of cells formatted as a number, not text. Can you help me how to achieve this?
excelApp = new Excel.Application();
excelWorkBooks = excelApp.Workbooks;
excelWorkBook = excelWorkBooks.Add();
excelSheets = excelWorkBook.Sheets;
excelWorkSheet = excelSheets[1]; //Beware! Excel is one-based as opposed to a zero-based C#
string[,] contents = new string[,] { { "1,23", "2,123123123", "3,1415926535" }, { "2,15", null, "" } };
int contentsHeight = contents.GetLength(0);
int contentsWidth = contents.GetLength(1);
System.Globalization.CultureInfo currentCulture = System.Threading.Thread.CurrentThread.CurrentCulture;
string numberFormat = string.Format("0" + currentCulture.NumberFormat.NumberDecimalSeparator + "00E+00");
for (int column = 0; column < contentsWidth; column++) {
excelWorkSheet.Columns[column + 1].NumberFormat = numberFormat;
}
Excel.Range range = excelWorkSheet.Range[excelWorkSheet.Cells[1, 1], excelWorkSheet.Cells[contentsHeight, contentsWidth]];
range.Value = contents;
// range.Value = range.Value; //Problematic place
EDIT: I tryed to change NumberFormat from 0,00E+00 to something like 0,0, 0.0, #,# for the sake of test, but with no success. Either crash (decimal dot) or remains as a text.
There's no need to convert numbers to text before writing them to a cell. Excel understands numbers. A further problem is that the code is trying to set the array as the value of an entire range, as if pasting into Excel.
It's possible to set numbers, even nulls, directly using a simple loop, eg
double?[,] contents = new double?[,] { { 1.23, 2.123123123, 3.1415926535 },
{ 2.15, null, null } };
int contentsHeight = contents.GetLength(0);
int contentsWidth = contents.GetLength(1);
...
for(int i=0;i<= contentsHeight; i++)
for (int j = 0; j <= contentsWidth; j++)
excelWorkSheet.Cells[i+1,j+1].Value = contents[i,j];
Instead of using Excel through Interop though, it's better to use a package like EPPlus to generate xlsx files directly without having Excel installed. This allows generating real Excel files even on web servers, where installing Excel is impossible.
The code for this particular problem would be similar:
var file = new FileInfo("test.xlsx");
using (var pck = new ExcelPackage(file))
{
var ws = pck.Workbook.Worksheets.Add("Rules");
for(int i=0;i<= contentsHeight; i++)
for (int j = 0; j <= contentsWidth; j++)
ws.Cells[i+1,j+1].Value = contents[i,j];
pck.Save();
}
EPPlus has some convenience methods that make loading a sheet easy, eg LoadFromDataTable or LoadFromCollection. If the data came from a DataTable, creating the sheet would be as simple as:
var file = new FileInfo("test.xlsx");
using (var pck = new ExcelPackage(file))
{
var ws = pck.Workbook.Worksheets.Add("Rules");
ws.LoadFromDataTable(myTable);
pck.Save();
}
LoadFromDataTable returns an ExcelRange which allows cell formatting just like Excel Interop.

Excel - Getting cell formatting is slow

I'm using C# to pull data from an Excel file. I need to get the text and some minor formatting data from a sheet. My test sheet has 115 rows and 10 columns. The performance seems sluggish. If I only pull out the text using the code below it takes about 2 seconds to run. If I check the font (in the if(c.Font.Bold==null..... line) it goes up to 8 seconds. If I get the borders info then it goes up to 17 seconds.
The problem is that I'll have many, many sheets I need to pull data from and speed will become an issue. Any suggestions on what I can do to speed this up? I really appreciate any help.
foreach (Range c in oSheet.UsedRange.Cells)
{
var txt = c.Text;
if (c.Font.Bold == null || c.Font.Italic == null || Convert.ToInt32(c.Font.Underline) > 0 || Convert.ToBoolean(c.Font.Bold) || Convert.ToBoolean(c.Font.Italic))
txt = "";
var borderBottom = c.Borders.Item[Microsoft.Office.Interop.Excel.XlBordersIndex.xlEdgeBottom].LineStyle;
var borderTop = c.Borders.Item[Microsoft.Office.Interop.Excel.XlBordersIndex.xlEdgeTop].LineStyle;
var borderLeft = c.Borders.Item[Microsoft.Office.Interop.Excel.XlBordersIndex.xlEdgeLeft].LineStyle;
var borderRight = c.Borders.Item[Microsoft.Office.Interop.Excel.XlBordersIndex.xlEdgeRight].LineStyle;
}
If your Excel file is a Excel 2007/2010 file (.xlsx), you can use ExcelPackage or EPPlus components to read the file. They are mush faster that office interop.
I used EPPlus and it iterated over 2000 cell almost instantly!
ExcelPackage ep = new ExcelPackage(new FileStream(path, FileMode.Open, FileAccess.Read));
var sheet = ep.Workbook.Worksheets[1];
foreach (var cell in sheet.Cells[sheet.Dimension.Address])
{
var txt = cell.Text;
var font = cell.Style.Font;
if (!font.Bold || font.Italic || font.UnderLine)
txt = "";
var borderBottom = cell.Style.Border.Bottom.Style;
var borderTop = cell.Style.Border.Top.Style;
var borderLeft = cell.Style.Border.Left.Style;
var borderRight = cell.Style.Border.Right.Style;
// ...
}
I'm not at all familiar with C#, but in VBA I use Application.ScreenUpdating property set to false on the start and set back to true when finished. In general case this dramatically increases speed, especially if macro performs any visible sheets updates.
I'm pretty sure such property should be available in C# as well. Hope that was helpful)
You could use the below steps. This is very fast and one line code ( no need of loops and all). I am taking a simple excel to explain here :
Before
I managed to store the range as A1:C4 in a variable dynamically in exRange and used the below code to give border
((Range)excelSheet.get_Range(exRange)).Cells.Borders.LineStyle = XlLineStyle.xlContinuous;
After

Categories