C# IronXL (Excel) + LINQ memory issue - c#

My goal is to find all the cells in an Excel containing a specific text. The Excel is quite large (about 2Mb) and has about 22 sheets. Historically we had problems with Interop, so I found IronXL which I love the way it operates.
The problem is that at some point, the RAM memory increases above 2Gb, and of course it's very slow.
I'm aware of the materialization issue, so I'm trying to avoid ToList() or Count() when using LINQ.
The first "problem" I found with IronXL is that the Cell class doesn't have any field specifying the sheet name where it is contained, so I divided the code in 2 sections:
The LINQ to find all the cells containing the text
Then I iterate in all previous results to store the desired cell info + sheet name where it was found in my custom class MyCell
The custom class:
class MyCell
{
public int X;
public int Y;
public string Location;
public string SheetName;
public MyCell(int x, int y, string location, string sheetName)
{
X = x;
Y = y;
Location = location;
SheetName = sheetName;
}
}
Here is my code:
List<MyCell> FindInExcel(WorkBook wb, string textToFind)
{
List<MyCell> res = new List<MyCell>();
var cells = from sheet in wb.WorkSheets
from cell in sheet
where cell.IsText && cell.Text.Contains(textToFind)
select new { cell, sheet };
foreach (var cell in cells)
{
res.Add(new MyCell(cell.cell.ColumnIndex, cell.cell.RowIndex, cell.cell.Location, cell.sheet.Name));
}
return res;
}
To test my method, I call:
WorkBook excel = WorkBook.Load("myFile.xlsx");
var results = FindInExcel(excel, "myText");
What happens when I execute and debug the code is indeed very weird. The LINQ query is executed very fast, and in my case I get 2 results. Then it starts iterating in the foreach, and the first 2 times the values are added to the list, so, everything is perfect. But the 3rd time, when it evaluates if any other item is available, is when the memory reaches 2Gb and takes like 10 seconds.
I observed the same behaviour when I do this:
int count = cells.Count()
I'm aware this is materializing the results, but what I don't understand is why I get the 2 first results in the foreach so fast, and it's only in the last step where the memory increases.
Seeing this behavior, it seems clear the code knows somewhere how many items has found without having to call the Count(), otherwise it would be slow the first time the "foreach" is called.
Just to know if I was getting crazy, I tried to put this small code in the FindInExcel method:
int cnt = 0;
foreach (var cell in cells)
{
res.Add(new MyCell(cell.cell.ColumnIndex, cell.cell.RowIndex, cell.cell.Location, cell.sheet.Name));
cnt++;
if (cnt == 2)
break;
}
In this last case, I don't have the memory issue and I finally get a List of 2 items with the cells I want, and without any memory issue.
What am I missing? Is there any way to do what I'm trying to do without materializing the results? I even tried to move to the .NET Framework 4.8.1 to see if some bug was fixed, but I'm getting the same behavior.
Note: If I use this code in a small Excel, it runs very fast.
Thank you in advance!

I already found the issue. There was a sheet where a hidden formula was extended to the last cell (M1048576), and so it was searching the value in all of these cells. Once removed, there is no memory issue anymore.
Thank you guys!

Related

C# Excel OutOfMemory exception caused by inserting to a range

I've been working on a quite complex C# VSTO project that does a lot of different things in Excel. However, I have recently stumbled upon a problem I have no idea how to solve. I'm afraid that putting the whole project here will overcomplicate my question and confuse everyone so this is the part with the problem:
//this is a simplified version of Range declaration which I am 100% confident in
Range range = worksheet.Range[firstCell, lastCell]
range.Formula = array;
//where array is a object[,] which basically contains only strings and also works perfeclty fine
The last line that is supposed to insert a [,] array to an Excel range used to work before for smaller Excel books, but now crashes for bigger books with a System.OutOfMemoryException: Insufficient memory to continue the execution of the program and I have no idea why, because it used to work with arrays as long as 500+ elements for one of its dimensions whereas now it crashes for an array with under 400 elements. Furthermore, the RAM usage is about 1.2GB at the moment of crash and I know this project is capable of running perfectly fine with the RAM usage of ~3GBs.
I have tried the following things: inserting this array row by row, then inserting it cell by cell, calling GC.Collect() before each insertion of a row or a cell but it would nonetheless crash with a System.OutOfMemoryException.
So I would appreciate any help in solving this problem or identifying where the error could possibly be hiding, because I can't wrap my head around why it refuses to work for arrays with smaller length (but maybe with slightly bigger contents) at the RAM usage level of 1.2GBs which is like 1/3 of what it used to handle. Thank you!
EDIT
I've been told in the comments that the code above might be too sparse, so here is a more detailed version (I hope it's not too confusing):
List<object[][]> controlsList = new List<object[][]>();
// this list is filled with a quite long method calling a lot of other functions
// if other parts look fine, I guess I'll have to investigate it
int totalRows = 1;
foreach (var control in controlsList)
{
if (control.Length == 0)
continue;
var range = worksheet.GetRange(totalRows + 1, 1, totalRows += control.Length, 11);
//control is an object[n][11] so normally there are no index issues with inserting
range.Formula = control.To2dArray();
}
//GetRange and To2dArray are extension methods
public static Range GetRange(this Worksheet sheet, int firstRow, int firstColumn, int lastRow, int lastColumn)
{
var firstCell = sheet.GetRange(firstRow, firstColumn);
var lastCell = sheet.GetRange(lastRow, lastColumn);
return (Range)sheet.Range[firstCell, lastCell];
}
public static Range GetRange(this Worksheet sheet, int row, int col) => (Range)sheet.CheckIsPositive(row, col).Cells[row, col];
public static T CheckIsPositive<T>(this T returnedValue, params int[] vals)
{
if (vals.Any(x => x <= 0))
throw new ArgumentException("Values must be positive");
return returnedValue;
}
public static T[,] To2dArray<T>(this T[][] source)
{
if (source == null)
throw new ArgumentNullException();
int l1 = source.Length;
int l2 = source[0].Length(1);
T[,] result = new T[l1, l2];
for (int i = 0; i < l1; ++i)
for (int j = 0; j < l2; ++j)
result[i, j] = source[i][j];
return result;
}
I am not 100% sure I figured it out correctly, but it seems like the issue lies within Interop.Excel/Excel limitations and the length of formulas I'm trying to insert: whenever the length approaches 8k characters, which is close to Excel limit for the formula contents, the System.OutOfMemoryException pops out. When I opted to leave lengthy formulas out, the program started working fine.

Generating a variable number of random numbers to a list, then comparing those numbers to a target?

How would I go about generating a serializable variable of random numbers to a list, then comparing those generated numbers to a target number?
What I want to do is make a program that takes in a number, such as 42, and generates that number of random numbers to a list while still keeping the original variable, in this case 42, to be referenced later. Super ham-handed pseudo-code(?) example:
public class Generate {
[SerializeField]
int generate = 42;
List<int> results = new List<int>;
public void Example() {
int useGenerate = generate;
//Incoming pseudo-code (rather, code that I don't know how to do, exactly)
while (useGenerate => 1) {
results.add (random.range(0,100)); //Does this make a number between 0 and 99?
int useGenerate = useGenerate - 1;
}
}
}
I think this will do something to that effect, once I figure out how to actually code it properly (Still learning).
From there, I'd like to compare the list of results to a target number, to see how many of them pass a certain threshold, in this case greater than or equal to 50. I assume this would require a "foreach" thingamabobber, but I'm not sure how to go about doing that, really. With each "success", I'd like to increment a variable to be returned at a later point. I guess something like this:
int success = 50;
int target = 0;
foreach int in List<results> {
if (int => success) {
int target = target + 1;
}
}
If I have the right idea, please just teach me how to properly code it. If you have any suggestions on how to improve it (like the whole ++ and -- thing I see here and there but don't know how to use), please teach me that, too. I looked around the web for using foreach with lists and it seemed really complicated and people were seemingly pulling some new bit of information from the Aether to include in the operation. Thanks for reading, and thanks in advance for any advice!

C# Sudoku puzzle solver gets stuck while trying to backtrack to avoid bad values

I'm trying to get better at understanding recursion and I'm working on a program that is supposed to solve Sudoku puzzles by iterating over each cell in the puzzle and trying a value out of a list of possible values. After it drops in a value, it checks if doing so has solved the puzzle. If not, it continues on adding values and checking if doing so has solved the puzzle. There's a problem, though - I can see in my app's GUI that it eventually gets confused or stuck and starts attempting the same value insertion over and over. For example, here's an example of some logging that shows it being stuck:
Possible values: F
Old string: 086251B7D93+++C4
New string: 086251B7D93F++C4
As you can see, it's looking to see what value could possibly go into the cell after the 3 without violating the row, column, or box constraints. The only possible value is F, so it places it in there. But then it does the same thing over and over, and randomly starts attempting to edit other cells in other lines even though there are still +'s in this first row. It'll try to edit a row a few rows down, and it'll swap between doing that and then attempting this same scenario with the F repeatedly until the app crashes because of what Visual Studio thinks is a ContextSwitchDeadlock error.
What have I missed in my algorithm? Here is my method that tries to fill in the Sudoku puzzle:
public bool SolveGrid(List<Row> rows, List<Box> boxes)
{
textBox1.Clear()
foreach (var row in rows)
{
textBox1.AppendText(row.content + "\n");
}
if (ValidatorAgent.IsGridSolved(rows, boxes))
{
textBox2.AppendText("Success!");
successGrid = rows;
return true;
}
foreach (var row in rows)
{
foreach (var cell in row.content)
{
if (cell.Equals('+'))
{
var possibleValues = availableValues.Where(v => !ValidatorAgent.AreGridConstraintsBreached(cell, v, row, rows, boxes)).ToList();
if (!possibleValues.Any())
{
return false;
}
Console.WriteLine("Possible values: ");
possibleValues.ForEach(v => Console.Write(v.ToString()));
List<char> values = new List<char>();
possibleValues.ForEach(v => values.Add(v));
foreach (var possibleValue in values)
{
var oldContent = row.content;
var stringBuilder = new StringBuilder(row.content);
var indexToChange = row.content.IndexOf(cell);
stringBuilder[indexToChange] = possibleValue;
Console.WriteLine("Old string: " + oldContent);
Console.WriteLine("New string: " + stringBuilder.ToString());
var alteredPuzzle = BuilderAgent.BuildPuzzleCopy(rows);
alteredPuzzle.Single(r => r.yCoord == row.yCoord).content = stringBuilder.ToString();
var newBoxes = BuilderAgent.BuildBoxes(alteredPuzzle);
SolveGrid(alteredPuzzle, newBoxes);
}
}
}
}
return false;
}
I am pretty confident that my IsGridSolved method works correctly because I've tested it on puzzle solutions and found that it reported whether or not a puzzle was already solved. I've also tested my AreGridConstraintsBreached method with unit tests and also by eyeballing the results of the code, so I'm fairly confident in that code. It basically checks whether or not a cell's potential value is redundant due to the relevant row, column, or box already having that value. In Sudoku, when you place a value you need to make sure that the value 1) doesn't already exist in the row 2) doesn't already exist in the column and 3) doesn't already exist in the small box surrounding the cell. I feel like I'm missing something obvious in the algorithm itself, and that it isn't backtracking properly. I've tried fiddling with the SolveGrid method so that it takes in a list of validValues that are based off of the availableValues, with some pruning so that failed values are pruned out after each recursive call to SolveGrid but that didn't do anything besides keep the puzzle from even getting as far as it currently does. Any tips would be appreciated.

Excel Interop Coloring Cell Using Range

I need to color Excel cells in a fast manner. I found similar method to write to Excel cells which for me is really fast, so I tried applying the same method when coloring the cells. Consider the following code:
xlRange = xlWorksheet.Range["A6", "AS" + dtSchedule.Rows.Count];
double[,] colorData = new double[dtSchedule.Rows.Count, dtSchedule.Columns.Count];
for (var row = 0; row < dtSchedule.Rows.Count; row++)
{
for (var column = 0; column < dtSchedule.Columns.Count; column++)
{
if (column <= 3)
{
colorData[row, column] = GetLightColor2("#ffffff");
continue;
}
if (dtSchedule.Rows[row][column].ToString() != "#000000" && !string.IsNullOrEmpty(dtSchedule.Rows[row][column].ToString()))
{
string[] schedule = dtSchedule.Rows[row][column].ToString().Split('/');
string color = schedule[0].Trim();
colorData[row, column] = GetLightColor2(color);
continue;
}
colorData[row, column] = GetLightColor2("#000000");
}
}
xlRange.Interior.Color = colorData;
This is the GetLightColor2 function:
private double GetLightColor2(string hex)
{
return ColorTranslator.ToOle(ColorTranslator.FromHtml(hex));
}
When I ran the code, an error was thrown at
xlRange.Interior.Color = colorData;
With the following error:
System.Runtime.InteropServices.COMException (0x80020005): Type
mismatch. (Exception from HRESULT: 0x80020005 (DISP_E_TYPEMISMATCH))
at System.RuntimeType.ForwardCallToInvokeMember(String memberName,
BindingFlags flags, Object target, Int32[] aWrapperTypes, MessageData&
msgData) at Microsoft.Office.Interop.Excel.Interior.set_Color(Object
value)
I could not find any other workaround unless coloring the cell by looping through each cell which is really slow. Or is it that I'm doing it the wrong way.
Thank you for your kind attention guys.
If your question is not about excel addin, I would strongly recommend to follow Akhil R J's advice. It's not the last big problem you'll encounter in interop, this technology is just one big problem and bug. If for some reason you cannot, I can tell you some things about your problem:
1) There is no way to do what you want using arrays. It is possible only for values and formulas.
2) Set Application.ScreenUpdating = false, when you set colors or any other operation with excel. Then it freezes user input, things go faster.
3) If many cells have the same color - use Application.Union to make a range from separated cells of the same color. But it's effective only to merge up to 50 cells in one time. If you take more, merging operation takes too much time, and it's not effective. After that, just set one color to the whole merged range. Pretty effective, around 5-10 times faster in my case.
4) There is another way, difficult one. I going to try it myself for the same problem (I have an addin, so I cannot just start to use OpenXML). Using interop, you can copy the target range to the windows clipboard. In the clipboard it is stored in many formats, including something OpenXMl-like. So you can edit it in the clipboard and paste back, using interop again. I think it's the fastest way, but it must be very time consuming to write this code.

Most efficient way to process a large csv in .NET

Forgive my noobiness but I just need some guidance and I can't find another question that answers this. I have a fairly large csv file (~300k rows) and I need to determine for a given input, whether any line in the csv begins with that input. I have sorted the csv alphabetically, but I don't know:
1) how to process the rows in the csv- should I read it in as a list/collection, or use OLEDB, or an embedded database or something else?
2) how to find something efficiently from an alphabetical list (using the fact that it's sorted to speed things up, rather than searching the whole list)
You don't give enough specifics to give you a concrete answer but...
IF the CSV file changes often then use OLEDB and just change the SQL query based on your input.
string sql = #"SELECT * FROM [" + fileName + "] WHERE Column1 LIKE 'blah%'";
using(OleDbConnection connection = new OleDbConnection(
#"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + fileDirectoryPath +
";Extended Properties=\"Text;HDR=" + hasHeaderRow + "\""))
IF the CSV file doesn't change often and you run a lot of "queries" against it, load it once into memory and quickly search it each time.
IF you want your search to be an exact match on a column use a Dictionary where the key is the column you want to match on and the value is the row data.
Dictionary<long, string> Rows = new Dictionar<long, string>();
...
if(Rows.ContainsKey(search)) ...
IF you want your search to be a partial match like StartsWith then have 1 array containing your searchable data (ie: first column) and another list or array containing your row data. Then use C#'s built in binary search http://msdn.microsoft.com/en-us/library/2cy9f6wb.aspx
string[] SortedSearchables = new string[];
List<string> SortedRows = new List<string>();
...
string result = null;
int foundIdx = Array.BinarySearch<string>(SortedSearchables, searchTerm);
if(foundIdx < 0) {
foundIdx = ~foundIdx;
if(foundIdx < SortedRows.Count && SortedSearchables[foundIdx].StartsWith(searchTerm)) {
result = SortedRows[foundIdx];
}
} else {
result = SortedRows[foundIdx];
}
NOTE code was written inside the browser window and may contain syntax errors as it wasn't tested.
If you can cache the data in memory, and you only need to search the list on one primary key column, I would recommend storing the data in memory as a Dictionary object. The Dictionary class stores the data as key/value pairs in a hash table. You could use the primary key column as the key in the dictionary, and then use the rest of the columns as the value in the dictionary. Looking up items by key in a hash table is typically very fast.
For instance, you could load the data into a dictionary, like this:
Dictionary<string, string[]> data = new Dictionary<string, string[]>();
using (TextFieldParser parser = new TextFieldParser("C:\test.csv"))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
while (!parser.EndOfData)
{
try
{
string[] fields = parser.ReadFields();
data[fields[0]] = fields;
}
catch (MalformedLineException ex)
{
// ...
}
}
}
And then you could get the data for any item, like this:
string fields[] = data["key I'm looking for"];
If you're only doing it once per program run, this seems pretty fast. (Updated to use StreamReader instead of FileStream based on comments below)
static string FindRecordBinary(string search, string fileName)
{
using (StreamReader fs = new StreamReader(fileName))
{
long min = 0; // TODO: What about header row?
long max = fs.BaseStream.Length;
while (min <= max)
{
long mid = (min + max) / 2;
fs.BaseStream.Position = mid;
fs.DiscardBufferedData();
if (mid != 0) fs.ReadLine();
string line = fs.ReadLine();
if (line == null) { min = mid+1; continue; }
int compareResult;
if (line.Length > search.Length)
compareResult = String.Compare(
line, 0, search, 0, search.Length, false );
else
compareResult = String.Compare(line, search);
if (0 == compareResult) return line;
else if (compareResult > 0) max = mid-1;
else min = mid+1;
}
}
return null;
}
This runs in 0.007 seconds for a 600,000 record test file that's 50 megs. In comparison a file-scan averages over half a second depending where the record is located. (a 100 fold difference)
Obviously if you do it more than once, caching is going to speed things up. One simple way to do partial caching would be to keep the StreamReader open and re-use it, just reset min and max each time through. This would save you storing 50 megs in memory all the time.
EDIT: Added knaki02's suggested fix.
Given the CSV is sorted - if you can load the entire thing into memory (If the only processing you need to do is a .StartsWith() on each line) - you can use a Binary search to have exceptionally fast searching.
Maybe something like this (NOT TESTED!):
var csv = File.ReadAllLines(#"c:\file.csv").ToList();
var exists = csv.BinarySearch("StringToFind", new StartsWithComparer());
...
public class StartsWithComparer: IComparer<string>
{
public int Compare(string x, string y)
{
if(x.StartsWith(y))
return 0;
else
return x.CompareTo(y);
}
}
I wrote this quickly for work, could be improved on...
Define the column numbers:
private enum CsvCols
{
PupilReference = 0,
PupilName = 1,
PupilSurname = 2,
PupilHouse = 3,
PupilYear = 4,
}
Define the Model
public class ImportModel
{
public string PupilReference { get; set; }
public string PupilName { get; set; }
public string PupilSurname { get; set; }
public string PupilHouse { get; set; }
public string PupilYear { get; set; }
}
Import and populate a list of models:
var rows = File.ReadLines(csvfilePath).Select(p => p.Split(',')).Skip(1).ToArray();
var pupils = rows.Select(x => new ImportModel
{
PupilReference = x[(int) CsvCols.PupilReference],
PupilName = x[(int) CsvCols.PupilName],
PupilSurname = x[(int) CsvCols.PupilSurname],
PupilHouse = x[(int) CsvCols.PupilHouse],
PupilYear = x[(int) CsvCols.PupilYear],
}).ToList();
Returns you a list of strongly typed objects
If your file is in memory (for example because you did sorting) and you keep it as an array of strings (lines) then you can use a simple bisection search method. You can start with the code on this question on CodeReview, just change the comparer to work with string instead of int and to check only the beginning of each line.
If you have to re-read the file each time because it may be changed or it's saved/sorted by another program then the most simple algorithm is the best one:
using (var stream = File.OpenText(path))
{
// Replace this with you comparison, CSV splitting
if (stream.ReadLine().StartsWith("..."))
{
// The file contains the line with required input
}
}
Of course you may read the entire file in memory (to use LINQ or List<T>.BinarySearch()) each time but this is far from optimal (you'll read everything even if you may need to examine just few lines) and the file itself could even be too large.
If you really need something more and you do not have your file in memory because of sorting (but you should profile your actual performance compared to your requirements) you have to implement a better search algorithm, for example the Boyer-Moore algorithm.
OP stated really just needs to search based on line.
The questions is then to hold the lines in memory or not.
If the line 1 k then 300 mb of memory.
If a line is 1 meg then 300 gb of memory.
Stream.Readline will have a low memory profile
Since it is sorted you can stop looking once it is greater than.
If you hold it in memory then a simple
List<String>
With LINQ will work.
LINQ is not smart enough to take advantage of the sort but against 300K would still be pretty fast.
BinarySearch will take advantage of the sort.
Try the free CSV Reader. No Need to invent the wheel over and over again ;)
1) If you do not need to store the results, just iterate though the CSV - handle each line and forget it. If you need to process all lines again and again, store them in a List or Dictionary (with a good key of course)
2) Try the generic extension methods like this
var list = new List<string>() { "a", "b", "c" };
string oneA = list.FirstOrDefault(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a"));
IEnumerable<string> allAs = list.Where(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a"));
Here is my VB.net Code. It is for a Quote Qualified CSV, so for a regular CSV, change Let n = P.Split(New Char() {""","""}) to Let n = P.Split(New Char() {","})
Dim path as String = "C:\linqpad\Patient.txt"
Dim pat = System.IO.File.ReadAllLines(path)
Dim Patz = From P in pat _
Let n = P.Split(New Char() {""","""}) _
Order by n(5) _
Select New With {
.Doc =n(1), _
.Loc = n(3), _
.Chart = n(5), _
.PatientID= n(31), _
.Title = n(13), _
.FirstName = n(9), _
.MiddleName = n(11), _
.LastName = n(7),
.StatusID = n(41) _
}
Patz.dump
Normally I would recommend finding a dedicated CSV parser (like this or this). However, I noticed this line in your question:
I need to determine for a given input, whether any line in the csv begins with that input.
That tells me that computer time spend parsing CSV data before this is determined is time wasted. You just need code to simply match text for text, and you can do that via a string comparison as easily as anything else.
Additionally, you mention that the data is sorted. This should allow you speed things up tremendously... but you need to be aware that to take advantage of this you will need to write your own code to make seek calls on low-level file streams. This will be by far your best performing result, but it will also by far require the most initial work and maintenance.
I recommend an engineering based approach, where you set a performance goal, build something relatively simple, and measure the results against that goal. In particular, start with the 2nd link I posted above. The CSV reader there will only load one record into memory at a time, so it should perform reasonably well, and it's easy to get started with. Build something that uses that reader, and measure the results. If they meet your goal, then stop there.
If they don't meet your goal, adapt the code from the link so that as you read each line you first do a string comparison (before bothering to parse the csv data), and only do the work to parse csv for the lines that match. This should perform better, but only do the work if the first option does not meet your goals. When this is ready, measure the performance again.
Finally, if you still don't meet the performance goal, we're into the territory of writing low-level code to do a binary search on your file stream using seek calls. This is likely the best you'll be able to do, performance-wise, but it will be very messy and bug-prone code to write, and so you only want to go here if you absolutely do not meet your goals from earlier steps.
Remember, performance is a feature, and just like any other feature you need to evaluate how you build for that feature relative to real design goals. "As fast as possible" is not a reasonable design goal. Something like "respond to a user search within .25 seconds" is a real design goal, and if the simpler but slower code still meets that goal, you need to stop there.

Categories