Excel - Getting cell formatting is slow - c#

I'm using C# to pull data from an Excel file. I need to get the text and some minor formatting data from a sheet. My test sheet has 115 rows and 10 columns. The performance seems sluggish. If I only pull out the text using the code below it takes about 2 seconds to run. If I check the font (in the if(c.Font.Bold==null..... line) it goes up to 8 seconds. If I get the borders info then it goes up to 17 seconds.
The problem is that I'll have many, many sheets I need to pull data from and speed will become an issue. Any suggestions on what I can do to speed this up? I really appreciate any help.
foreach (Range c in oSheet.UsedRange.Cells)
{
var txt = c.Text;
if (c.Font.Bold == null || c.Font.Italic == null || Convert.ToInt32(c.Font.Underline) > 0 || Convert.ToBoolean(c.Font.Bold) || Convert.ToBoolean(c.Font.Italic))
txt = "";
var borderBottom = c.Borders.Item[Microsoft.Office.Interop.Excel.XlBordersIndex.xlEdgeBottom].LineStyle;
var borderTop = c.Borders.Item[Microsoft.Office.Interop.Excel.XlBordersIndex.xlEdgeTop].LineStyle;
var borderLeft = c.Borders.Item[Microsoft.Office.Interop.Excel.XlBordersIndex.xlEdgeLeft].LineStyle;
var borderRight = c.Borders.Item[Microsoft.Office.Interop.Excel.XlBordersIndex.xlEdgeRight].LineStyle;
}

If your Excel file is a Excel 2007/2010 file (.xlsx), you can use ExcelPackage or EPPlus components to read the file. They are mush faster that office interop.
I used EPPlus and it iterated over 2000 cell almost instantly!
ExcelPackage ep = new ExcelPackage(new FileStream(path, FileMode.Open, FileAccess.Read));
var sheet = ep.Workbook.Worksheets[1];
foreach (var cell in sheet.Cells[sheet.Dimension.Address])
{
var txt = cell.Text;
var font = cell.Style.Font;
if (!font.Bold || font.Italic || font.UnderLine)
txt = "";
var borderBottom = cell.Style.Border.Bottom.Style;
var borderTop = cell.Style.Border.Top.Style;
var borderLeft = cell.Style.Border.Left.Style;
var borderRight = cell.Style.Border.Right.Style;
// ...
}

I'm not at all familiar with C#, but in VBA I use Application.ScreenUpdating property set to false on the start and set back to true when finished. In general case this dramatically increases speed, especially if macro performs any visible sheets updates.
I'm pretty sure such property should be available in C# as well. Hope that was helpful)

You could use the below steps. This is very fast and one line code ( no need of loops and all). I am taking a simple excel to explain here :
Before
I managed to store the range as A1:C4 in a variable dynamically in exRange and used the below code to give border
((Range)excelSheet.get_Range(exRange)).Cells.Borders.LineStyle = XlLineStyle.xlContinuous;
After

Related

Validate column with Excel formula increments the formula is expected behavior or bug?

I'm trying to create a spreadsheet where the first sheet ("Catalog") contains some pre-filled and some empty values in a column. I want the values to be in a drop down list that are restricted to values found in the second sheet ("Products").
I would expect that if I set the the Excel validation formula for cells "A1:A1048576" in the "Catalog" sheet to be a list validation of "Products!A1:A100" that every cell would only allow values from "Products!A1:A100". However, I'm finding that my formula gets incremented for every row in the "Catalog" sheet (i.e. In row 2 the formula becomes "Products!A2:A101", in row 3 the formula becomes "Products!A3:A102").
If version matters I'm using EPPlus.Core v1.5.4 from NuGet.
I'm not sure if this is a bug or if I'm going about applying my formula wrong?
I've already tried directly applying the validation to every cell in the column one cell at a time. I found that not only does it moderately increase the size of the resulting Excel file but more importantly it also exponentially increases the time taken to generate the Excel file. Even applying the validation one cell at a time on the first 2000 rows more than doubles the generation time.
ExcelPackage package = new ExcelPackage();
int catalogProductCount = 10;
int productCount = 100;
var catalogWorksheet = package.Workbook.Worksheets.Add($"Catalog");
for (int i = 1; i <= catalogProductCount; i++)
{
catalogWorksheet.Cells[i, 1].Value = $"Product {i}";
}
var productsWorksheet = package.Workbook.Worksheets.Add($"Products");
for (int i = 1; i <= productCount; i++)
{
productsWorksheet.Cells[i, 1].Value = $"Product {i}";
}
var productValidation = catalogWorksheet.DataValidations.AddListValidation($"A1:A1048576");
productValidation.ErrorStyle = ExcelDataValidationWarningStyle.stop;
productValidation.ErrorTitle = "An invalid product was entered";
productValidation.Error = "Select a product from the list";
productValidation.ShowErrorMessage = true;
productValidation.Formula.ExcelFormula = $"Products!A1:A{productCount}";
I guess I'm not that adept at Excel formulas.
Changing this line:
productValidation.Formula.ExcelFormula = $"Products!A1:A{productCount}";
to this:
productValidation.Formula.ExcelFormula = $"Products!$A$1:$A${productCount}";
stopped the auto increment issue. Hopefully this answer will save someone else some sanity as I wasted half a day on this issue myself.

NPOI issue with FirstRowNum and LastRowNum returning -1 for a sheet that has rows

When trying to get the rows from an Excel sheet using NPOI, FirstRowNum and LastRowNum return -1.
IWorkbook workbook = null;
List<ImportedKPI> excelRows = new List<ImportedKPI>();
MemoryStream ms = new MemoryStream(array);
ISheet sheet = null;
workbook = WorkbookFactory.Create(ms);
sheet = workbook.GetSheet(mapping.Sheet);
//Do some stuff here and try to get rows
for (int i = sheet.FirstRowNum; i <= sheet.LastRowNum; i++)
//sheet.FirstRowNum = -1 && sheet.LastRowNum = -1
Another worthy mention is that when I save open the file and close it (with LibreOffice) it asks if I want to save and after that it works.
When comparing the byte arrays before/after save they are different.
Turns out not all excel parsers work for all types.
And since I get excels by email I had to do a chain of responsibility pattern based on NPOI and ExcelDataReader nugets.
This means that the when the excel comes I try to use a first class to parse it (ExcelDataReader first) and if it's not successful it moves on to NPOI and tries to parse it with that.
Seems like the better way when there are many sources.

Excel: Losing decimal separator when converting from strings to number

I am trying to read some values from several files and save them in a new .xlsx file with different grouping. I devised a very simple setup to test different formatting and behavior with null values. I always open just-created file in Excel to see outcome. So far no problem.
However in my test-case I can achieve either: A) save the test values as they are (strings) or B) force Excel to regard them as numbers with given format (good), but lose decimal separator (very bad & strange).
I had traced problem to the last line in a code snippet below. The idea of self-assign is from another post somewhere here at SO but right now I am unable to find it.
If the line is commented-out the results are as in a string[,] contents only they are formatted as text (and Excel complains about this with "number formatted as text" message). If I uncomment it, the numbers are regarded as numbers but lose decimal separators. Also the problem might be a fact that I am in Czech Republic and decimal separator is , which might trouble Excel. Moreover, reading the values from start into a double[,] contents is out, since I need to indicate whether value is absent (with empty cell). And double?[,] contents crashes Excel...
Please, havenĀ“t you met this behavior before? I would like to 1) be able to indicate missing value and 2) have contents of cells formatted as a number, not text. Can you help me how to achieve this?
excelApp = new Excel.Application();
excelWorkBooks = excelApp.Workbooks;
excelWorkBook = excelWorkBooks.Add();
excelSheets = excelWorkBook.Sheets;
excelWorkSheet = excelSheets[1]; //Beware! Excel is one-based as opposed to a zero-based C#
string[,] contents = new string[,] { { "1,23", "2,123123123", "3,1415926535" }, { "2,15", null, "" } };
int contentsHeight = contents.GetLength(0);
int contentsWidth = contents.GetLength(1);
System.Globalization.CultureInfo currentCulture = System.Threading.Thread.CurrentThread.CurrentCulture;
string numberFormat = string.Format("0" + currentCulture.NumberFormat.NumberDecimalSeparator + "00E+00");
for (int column = 0; column < contentsWidth; column++) {
excelWorkSheet.Columns[column + 1].NumberFormat = numberFormat;
}
Excel.Range range = excelWorkSheet.Range[excelWorkSheet.Cells[1, 1], excelWorkSheet.Cells[contentsHeight, contentsWidth]];
range.Value = contents;
// range.Value = range.Value; //Problematic place
EDIT: I tryed to change NumberFormat from 0,00E+00 to something like 0,0, 0.0, #,# for the sake of test, but with no success. Either crash (decimal dot) or remains as a text.
There's no need to convert numbers to text before writing them to a cell. Excel understands numbers. A further problem is that the code is trying to set the array as the value of an entire range, as if pasting into Excel.
It's possible to set numbers, even nulls, directly using a simple loop, eg
double?[,] contents = new double?[,] { { 1.23, 2.123123123, 3.1415926535 },
{ 2.15, null, null } };
int contentsHeight = contents.GetLength(0);
int contentsWidth = contents.GetLength(1);
...
for(int i=0;i<= contentsHeight; i++)
for (int j = 0; j <= contentsWidth; j++)
excelWorkSheet.Cells[i+1,j+1].Value = contents[i,j];
Instead of using Excel through Interop though, it's better to use a package like EPPlus to generate xlsx files directly without having Excel installed. This allows generating real Excel files even on web servers, where installing Excel is impossible.
The code for this particular problem would be similar:
var file = new FileInfo("test.xlsx");
using (var pck = new ExcelPackage(file))
{
var ws = pck.Workbook.Worksheets.Add("Rules");
for(int i=0;i<= contentsHeight; i++)
for (int j = 0; j <= contentsWidth; j++)
ws.Cells[i+1,j+1].Value = contents[i,j];
pck.Save();
}
EPPlus has some convenience methods that make loading a sheet easy, eg LoadFromDataTable or LoadFromCollection. If the data came from a DataTable, creating the sheet would be as simple as:
var file = new FileInfo("test.xlsx");
using (var pck = new ExcelPackage(file))
{
var ws = pck.Workbook.Worksheets.Add("Rules");
ws.LoadFromDataTable(myTable);
pck.Save();
}
LoadFromDataTable returns an ExcelRange which allows cell formatting just like Excel Interop.

Excel Interop Open/Repair HResult exception

What I do: populate & format an Excel file using a mix of Interop and ClosedXML.
First, the file is populated via Interop, then saved, closed, then I format the cells' RichText using ClosedXML.
Unfortunately, this formatting causes Excel to view my file as "corrupt" and needs to repair it.
This is the relevant part:
var workbook = new XLWorkbook(xlsPath);
var sheet = workbook.Worksheet("Error Log");
for (var rownum = 2; rownum <= 10000; rownum++)
{
var oldcell = sheet.Cell("C" + rownum);
var newcell = sheet.Cell("D" + rownum);
var oldtext = oldcell.GetFormattedString();
if(string.IsNullOrEmpty(oldtext.Trim()))
break;
XlHelper.ColorCellText(oldcell, "del", System.Drawing.Color.Red);
XlHelper.ColorCellText(newcell, "add", System.Drawing.Color.Green);
}
workbook.Save();
And the colouring method:
public static void ColorCellText(IXLCell cel, string tagName, System.Drawing.Color col)
{
var rex = new Regex("\\<g\\sid\\=[\\sa-z0-9\\.\\:\\=\\\"]+?\\>");
var txt = cel.GetFormattedString();
var mc = rex.Matches(txt);
var xlcol = XLColor.FromColor(col);
foreach (Match m in mc)
{
txt = txt.Replace(m.Value, "");
txt = txt.Replace("</g>", "");
}
var startTag = string.Format("[{0}]", tagName);
var endTag = string.Format("[/{0}]", tagName);
var crt = cel.RichText;
crt.ClearText();
while (txt.Contains(startTag) || txt.Contains(endTag))
{
var pos1 = txt.IndexOf(startTag);
if (pos1 == -1)
pos1 = 0;
var pos2 = txt.IndexOf(endTag);
if (pos2 == -1)
pos2 = txt.Length - 1;
var txtLen = pos2 - pos1 - 5;
crt.AddText(txt.Substring(0, pos1));
crt.AddText(txt.Substring(pos1 + 5, txtLen)).SetFontColor(xlcol);
txt = txt.Substring(pos2 + 6);
}
if (!string.IsNullOrEmpty(txt))
crt.AddText(txt);
}
Error in file myfile.xlsx
The following repairs were performed: _x000d__x000a__x000d__x000a_
Repaired records:
string properties of /xl/sharedStrings.xml-Part (strings)
I've been through all the xmls looking for clues. In the affected sheet, in comparison view of Productivity Tool, some blocks appear as inserted in the repaired file and deleted in the corrupt one, although nothing significant seemed changed - except for one thing: the style attribute of that cell. Here an example:
<x:c r="AA2" s="59">
<x:f>
(IFERROR(VLOOKUP(G2,Legende!$A$42:$B$45,2,FALSE),0))
</x:f>
</x:c>
I have checked the styles.xml for style 59, but there is none. In the repaired file, this style has been changed to 14, which in my styles.xml is listed as a number format.
Unfortunately, a global search/replace of these invalid style indexes did not resolve the issue.
Seeing the things going on here with corrupt indexes, renamed xmls, invalid named ranges etc., I took a different route: not to use interop at all, maybe the corruption was caused by Excel in the first place and the coloring was only the last straw.
Using ClosedXml only:
Wow. Just wow. This makes it even worse. I commented out the colouring part since without that, Interop produced a readable file without errors, so that's what I expect of ClosedXml too.
This is how I open the file and address the worksheet with ClosedXml:
var wb= new XLWorkbook(xlsPath);
var errors = wb.Worksheet("Error Log");
This is how I write the values into the file:
errors.Cell(zeile, 1).SetValue(fname);
With zeile being a simple int counter.
I then dare to set a column width:
errors.Column(2).Width = 50;
errors.Column(3).Width = 50;
errors.Column(4).Width = 50;
As well as setting some values in another sheet in exactly the same fashion before saving with validation.
wb.Save(true);
wb.Dispose();
Lo and behold: The validation throws errors:
Attribute 'name' should have unique value. Its current value 'Legende duplicates with others.
Attribute 'sheetId' should have unique value. Its current value '4' duplicates with others.
A couple more errors like attribute 'top' having invalid value '11.425781'.
Excel cannot open the file directly, must repair it. My Sheet "Legende" is now empty and the first sheet instead of third, and I get an additional fourth sheet "Restored_Table1" which contains my original "Legende" contents.
What the hell is going on with this file??
New attempt: re-create the Excel template from scratch - in LibreOffice.
I now think that the issue is entirely misleading. If I use the newly created file from LibreOffice, the validation causes a System.OutOfMemory exception due to too many validation errors. Opening in Excel requires repair, gives additional sheet and so forth.
Creating in LibreOffice, then opening in Excel, saving, then using that file as template produces a much better result albeit not perfect yet.
Since I copied parts over from the old Excel file into LO while creating the new file, I assume some corrupt remnant got copied over.
I cannot shake the feeling that this is the file itself after all and has nothing to do with how I edit it!
Will post updaate tomorrow.
OK. Stuff this.
I created a completely fresh file with LibreOffice, making sure not to copy over anything at all from the original file, and I ditched Interop in favour of ClosedXml.
=> This produced a corrupt file in which my first sheet was cleared and its contents move to a "Restored_Table1".
After I opened my fresh new template with Excel via Open/Repair and saved it, the resulting, uncoloured file was NOT corrupt.
=> Colouring it produces the "original" corruption, all sheets intact.
ClosedXml seems to be marginally slower than Interop but at this point I couldn't care less. I guess we will have to live with the "corrupt" message and just get on with it.
I hate xlsx.

Merging CSV lines in huge file

I have a CSV that looks like this
783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-01 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-01 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:15,1,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:30,2,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
...
783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-02 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-02 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
although there are 5 billion records. If you notice the first column and part of the 2nd column (the day), three of the records are all 'grouped' together and are just a breakdown of 15 minute intervals for the first 30 minutes of that day.
I want the output to look like
783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
...
783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
Where the first 4 columns of the repeating rows are ommitted and the rest of the columns are combined with the first record of it's kind. Basically I am converting the day from being each line is 15 minutes, to each line is 1 day.
Since I will be processing 5 billion records, I think the best thing is to use regular expressions (and EmEditor) or some tool that is made for this (multithreading, optimized), rather than a custom programmed solution. Althought I am open to ideas in nodeJS or C# that are relatively simple and super quick.
How can this be done?
If there's always a set number of records records and they're in order, it'd be fairly easy to just read a few lines at a time and parse and output them. Trying to do regex on billions of records would take forever. Using StreamReader and StreamWriter should make it possible to read and write these large files since they read and write one line at a time.
using (StreamReader sr = new StreamReader("inputFile.txt"))
using (StreamWriter sw = new StreamWriter("outputFile.txt"))
{
string line1;
int counter = 0;
var lineCountToGroup = 3; //change to 96
while ((line1 = sr.ReadLine()) != null)
{
var lines = new List<string>();
lines.Add(line1);
for(int i = 0; i < lineCountToGroup - 1; i++) //less 1 because we already added line1
lines.Add(sr.ReadLine());
var groupedLine = lines.SomeLinqIfNecessary();//whatever your grouping logic is
sw.WriteLine(groupedLine);
}
}
Disclaimer- untested code with no error handling and assuming that there are indeed the correct number of lines repeated, etc. You'd obviously need to do some tweaks for your exact scenario.
You could do something like this (untested code without any error handling - but should give you the general gist of it):
using (var sin = new SteamReader("yourfile.csv")
using (var sout = new SteamWriter("outfile.csv")
{
var line = sin.ReadLine(); // note: should add error handling for empty files
var cells = line.Split(","); // note: you should probably check the length too!
var key = cells[0]; // use this to match other rows
StringBuilder output = new StringBuilder(line); // this is the output line we build
while ((line = sin.ReadLine()) != null) // if we have more lines
{
cells = line.Split(","); // split so we can get the first column
while(cells[0] == key) // if the first column matches the current key
{
output.Append(String.Join(",",cells.Skip(4))); // add this row to our output line
}
// once the key changes
sout.WriteLine(output.ToString()); // write out the line we've built up
output.Clear();
output.Append(line); // update the new line to build
key = cells[0]; // and update the key
}
// once all lines have been processed
sout.WriteLine(output.ToString()); // We'll have just the last line to write out
}
The idea is to loop through each line in turn and keep track of the current value of the first column. When that value changes, you write out the output line you've been building up and update the key. This way you don't have to worry about exactly how many matches you have or if you might be missing a few points.
One note, it might be more efficient to use a StringBuilder for output rather than a String if you are going to concatentate 96 rows.
Define the ProcessOutputLine to store merged lines.
Call ProcessLine after each ReadLine and at end of file.
string curKey ="" ;
string keyLength = ... ; // set totalength of 4 first columns
string outputLine = "" ;
private void ProcessInputLine(string line)
{
string newKey=line.substring(0,keyLength) ;
if (newKey==curKey) outputline+=line.substring(keyLength) ;
else
{
if (outputline!="") ProcessOutPutLine(outputLine)
curkey = newKey ;
outputLine=Line ;
}
EDIT : this solution is very similar to that of Matt Burland, the only noticable difference is that I don't use the Split function.

Categories