I've been working on a quite complex C# VSTO project that does a lot of different things in Excel. However, I have recently stumbled upon a problem I have no idea how to solve. I'm afraid that putting the whole project here will overcomplicate my question and confuse everyone so this is the part with the problem:
//this is a simplified version of Range declaration which I am 100% confident in
Range range = worksheet.Range[firstCell, lastCell]
range.Formula = array;
//where array is a object[,] which basically contains only strings and also works perfeclty fine
The last line that is supposed to insert a [,] array to an Excel range used to work before for smaller Excel books, but now crashes for bigger books with a System.OutOfMemoryException: Insufficient memory to continue the execution of the program and I have no idea why, because it used to work with arrays as long as 500+ elements for one of its dimensions whereas now it crashes for an array with under 400 elements. Furthermore, the RAM usage is about 1.2GB at the moment of crash and I know this project is capable of running perfectly fine with the RAM usage of ~3GBs.
I have tried the following things: inserting this array row by row, then inserting it cell by cell, calling GC.Collect() before each insertion of a row or a cell but it would nonetheless crash with a System.OutOfMemoryException.
So I would appreciate any help in solving this problem or identifying where the error could possibly be hiding, because I can't wrap my head around why it refuses to work for arrays with smaller length (but maybe with slightly bigger contents) at the RAM usage level of 1.2GBs which is like 1/3 of what it used to handle. Thank you!
EDIT
I've been told in the comments that the code above might be too sparse, so here is a more detailed version (I hope it's not too confusing):
List<object[][]> controlsList = new List<object[][]>();
// this list is filled with a quite long method calling a lot of other functions
// if other parts look fine, I guess I'll have to investigate it
int totalRows = 1;
foreach (var control in controlsList)
{
if (control.Length == 0)
continue;
var range = worksheet.GetRange(totalRows + 1, 1, totalRows += control.Length, 11);
//control is an object[n][11] so normally there are no index issues with inserting
range.Formula = control.To2dArray();
}
//GetRange and To2dArray are extension methods
public static Range GetRange(this Worksheet sheet, int firstRow, int firstColumn, int lastRow, int lastColumn)
{
var firstCell = sheet.GetRange(firstRow, firstColumn);
var lastCell = sheet.GetRange(lastRow, lastColumn);
return (Range)sheet.Range[firstCell, lastCell];
}
public static Range GetRange(this Worksheet sheet, int row, int col) => (Range)sheet.CheckIsPositive(row, col).Cells[row, col];
public static T CheckIsPositive<T>(this T returnedValue, params int[] vals)
{
if (vals.Any(x => x <= 0))
throw new ArgumentException("Values must be positive");
return returnedValue;
}
public static T[,] To2dArray<T>(this T[][] source)
{
if (source == null)
throw new ArgumentNullException();
int l1 = source.Length;
int l2 = source[0].Length(1);
T[,] result = new T[l1, l2];
for (int i = 0; i < l1; ++i)
for (int j = 0; j < l2; ++j)
result[i, j] = source[i][j];
return result;
}
I am not 100% sure I figured it out correctly, but it seems like the issue lies within Interop.Excel/Excel limitations and the length of formulas I'm trying to insert: whenever the length approaches 8k characters, which is close to Excel limit for the formula contents, the System.OutOfMemoryException pops out. When I opted to leave lengthy formulas out, the program started working fine.
I want to know the most efficient way of replacing empty strings in an array with null values.
I have the following array:
string[] _array = new string [10];
_array[0] = "A";
_array[1] = "B";
_array[2] = "";
_array[3] = "D";
_array[4] = "E";
_array[5] = "F";
_array[6] = "G";
_array[7] = "";
_array[8] = "";
_array[9] = "J";
and I am currently replacing empty strings by the following:
for (int i = 0; i < _array.Length; i++)
{
if (_array[i].Trim() == "")
{
_array[i] = null;
}
}
which works fine on small arrays but I'm chasing some code that is the most efficient at doing the task because the arrays I am working with could be much larger and I would be repeating this process over and over again.
Is there a linq query or something that is more efficient?
You might consider switching _array[i].Trim() == "" with string.IsNullOrWhitespace(_array[i]) to avoid new string allocation. But that's pretty much all you can do to make it faster and still keep sequential. LINQ will not be faster than a for loop.
You could try making your processing parallel, but that seems like a bigger change, so you should evaluate if that's ok in your scenario.
Parallel.For(0, _array.Length, i => {
if (string.IsNullOrWhitespace(_array[i]))
{
_array[i] = null;
}
});
As far as efficiency it is fine but it also depends on how large the array is and the frequency that you would be iterating over such arrays. The main problem I see is that you could get a NullReferenceException with your trim method. A better approach is to use string.IsNullOrEmpty or string.IsNullOrWhiteSpace, the later is more along the lines of what you want but is not available in all versions of .net.
for (int i = 0; i < _array.Length; i++)
{
if (string.IsNullOrWhiteSpace(_array[i]))
{
_array[i] = null;
}
}
LINQ is mainly used for querying not for assignment. To do certain action on Collection, you could try to use List. If you use List instead of Array, you could do it with one line instead:
_list.ForEach(x => string.IsNullOrWhiteSpace(x) ? x = null; x = x);
A linq query will do essentially the same thing behind the scenes so you aren't going to gain any real efficiency simply by using linq.
When determining something more efficient, look at a few things:
How big will your array grow?
How often will the data in your array change?
Does the order of your array matter?
You've already answered that your array might grow to large sizes and performance is a concern.
So looking at options 2 and 3 together, if the order of your data doesn't matter then you could keep your array sorted and break the loop after you detect non-empty strings.
Ideally, you would be able to check the data on the way in so you don't have to constantly loop over your entire array. Is that not a possibility?
Hope this at least gets some thoughts going.
It's ugly, but you can eliminate the CALL instruction to the RTL, as I mentioned earlier, with this code:
if (_array[i] != null) {
Boolean blank = true;
for(int j = 0; j < value.Length; j++) {
if(!Char.IsWhiteSpace(_array[i][j])) {
blank = false;
break;
}
}
if (blank) {
_array[i] = null;
}
}
But it does add an extra assignment and includes an extra condition and it is just too ugly for me. But if you want to shave off nanoseconds off a massive list then perhaps this could be used. I like the idea of parallel processing and you could wrap this with Parallel.
Use the below code
_array = _array.Select(str => { if (str.Length == 0) str = null; return str; }).ToArray();
This is a very weird situation, first the code...
The code
private List<DispatchInvoiceCTNDataModel> WorksheetToDataTableForInvoiceCTN(ExcelWorksheet excelWorksheet, int month, int year)
{
int totalRows = excelWorksheet.Dimension.End.Row;
int totalCols = excelWorksheet.Dimension.End.Column;
DataTable dt = new DataTable(excelWorksheet.Name);
// for (int i = 1; i <= totalRows; i++)
Parallel.For(1, totalRows + 1, (i) =>
{
DataRow dr = null;
if (i > 1)
{
dr = dt.Rows.Add();
}
for (int j = 1; j <= totalCols; j++)
{
if (i == 1)
{
var colName = excelWorksheet.Cells[i, j].Value.ToString().Replace(" ", String.Empty);
lock (lockObject)
{
if (!dt.Columns.Contains(colName))
dt.Columns.Add(colName);
}
}
else
{
dr[j - 1] = excelWorksheet.Cells[i, j].Value != null ? excelWorksheet.Cells[i, j].Value.ToString() : null;
}
}
});
var excelDataModel = dt.ToList<DispatchInvoiceCTNDataModel>();
// now we have mapped everything expect for the IDs
excelDataModel = MapInvoiceCTNIDs(excelDataModel, month, year, excelWorksheet);
return excelDataModel;
}
The problem
When I am running the code on random occasion it would throw IndexOutOfRangeException on the line
dr[j - 1] = excelWorksheet.Cells[i, j].Value != null ? excelWorksheet.Cells[i, j].Value.ToString() : null;
For some random value of i and j. When I step over the code (F10), since it is running in a ParallelLoop, some other thread kicks and and other exception is throw, that other exception is something like (I could not reproduce it, it just came once, but I think it is also related to this threading issue) Column 31 not found in excelWorksheet. I don't understand how could any of these exception occur?
case 1
The IndexOutOfRangeException should not even occur, as the only code/shared variable dt I have locked around accessing it, rest all is either local or parameter so there should not have any thread related issue. Also, if I check the value of i or j in debug window, or even evaluate this whole expression dr[j - 1] = excelWorksheet.Cells[i, j].Value != null ? excelWorksheet.Cells[i, j].Value.ToString() : null; or a part of it in Debug window, then it works just fine, no errors of any sort or nothing.
case 2
For the second error, (which unfortunately is not reproducing now, but still) it should not have occurred as there are 33 columns in the excel.
More Code
In case someone might need how this method was called
using (var xlPackage = new ExcelPackage(viewModel.postedFile.InputStream))
{
ExcelWorksheets worksheets = xlPackage.Workbook.Worksheets;
// other stuff
var entities = this.WorksheetToDataTableForInvoiceCTN(worksheets[1], viewModel.Month, viewModel.Year);
// other stuff
}
Other
If someone needs more code/details let me know.
Update
Okay, to answer some comments. It is working fine when using for loop, I have tested that many times. Also, there is no particular value of i or j for which the exception is thrown. Sometimes it is 8, 6 at other time it could be anything, say 19,2or anything. Also, in the Parallel loop the +1 is not doing any damage as the msdn documentation says it is exclusive not inclusive. Also, if that were the issue I would only be getting exception at the last index (the last value of i) but that's not the case.
UPDATE 2
The given answer to lock around the code
dr = dt.Rows.Add();
I have changed it to
lock(lockObject) {
dr = dt.Rows.Add();
}
It is not working. Now I am getting ArgumentOutOfRangeException, still if I run this in debug window, it just works fine.
Update 3
Here is the full exception detail, after update 2 (I am getting this on the line that I mentioned in update 2)
System.ArgumentOutOfRangeException was unhandled by user code
HResult=-2146233086
Message=Index was out of range. Must be non-negative and less than the size of the collection.
Parameter name: index
Source=mscorlib
ParamName=index
StackTrace:
at System.ThrowHelper.ThrowArgumentOutOfRangeException()
at System.Collections.Generic.List`1.get_Item(Int32 index)
at System.Data.RecordManager.NewRecordBase()
at System.Data.DataTable.NewRecordFromArray(Object[] value)
at System.Data.DataRowCollection.Add(Object[] values)
at AdminEntity.BAL.Service.ExcelImportServices.<>c__DisplayClass2e.<WorksheetToDataTableForInvoiceCTN>b__2d(Int32 i) in C:\Projects\Manager\Admin\AdminEntity\AdminEntity.BAL\Service\ExcelImportServices.cs:line 578
at System.Threading.Tasks.Parallel.<>c__DisplayClassf`1.<ForWorker>b__c()
InnerException:
Okay. So there are a few problems with your existing code, most of which have been touched on by others:
Parallel threads are at the mercy of the OS scheduler; therefore, although threads are queued in-order, they may (and often do) complete execution out-of-order. For example, given Parallel.For(0, 10, (i) => { Console.WriteLine(i); });, the first four threads (on a quad-core system) will be queued with i values 0-3. But any one of those threads may start or finish executing before any other. So you may see 2 printed first, whereupon thread 4 will be queued. Then thread 1 might complete, and thread 5 will be queued. Then thread 4 might complete, even before threads 0 or 3 do. Etc., etc. TL;DR: You CANNOT assume an ordered output in parallel.
Given that, as #ScottChamberlain noted, it's a very bad idea to do column generation within your parallel loop - because you have no guarantee that the thread doing column generation will create all your columns before another thread starts assigning data in rows to those column indices. E.g. you could be assigning data to cell [0,4] before your table actually has a fifth column.
It's worth noting that this should really be broken out of the loop anyway, purely from a code cleanliness perspective. At the moment, you have two nested loops, each with special behavior on a single iteration; better to separate that setup logic into its own loop and leave the main loop to assign data and nothing else.
For the same reason, you should not be creating new rows in the table within your parallel loop - because you have no guarantee that the rows will be added to the table in their source order. Break that out too, and access rows within the loop based on their index.
Some have mentioned using DataRow.NewRow() before Rows.Add(). Technically, NewRow() is the right way to go about things, but the actual recommended access pattern is a bit different than is probably appropriate for a cell-by-cell function, particularly when parallelism is intended (see MSDN: DataTable.NewRow Method). The fact remains that adding a new, blank row to a DataTable with Rows.Add() and populating it afterwards functions properly.
You can clean up your string formatting with the null-coalescing operator ??, which evaluates whether the preceding value is null, and if so, assigns the subsequent value. For example, foo = bar ?? "" is the equivalent of if (bar == null) { foo = ""; } else { foo = bar; }.
So right off the bat, your code should look more like this:
private void ReadIntoTable(ExcelWorksheet sheet)
{
DataTable dt = new DataTable(sheet.Name);
int height = sheet.Dimension.Rows;
int width = sheet.Dimension.Columns;
for (int j = 1; j <= width; j++)
{
string colText = (sheet.Cells[1, j].Value ?? "").ToString();
dt.Columns.Add(colText);
}
for (int i = 2; i <= height; i++)
{
dt.Rows.Add();
}
Parallel.For(1, height, (i) =>
{
var row = dt.Rows[i - 1];
for (int j = 0; j < width; j++)
{
string str = (sheet.Cells[i + 1, j + 1].Value ?? "").ToString();
row[j] = str;
}
});
// convert to your special Excel data model
// ...
}
Much better!
...but it still doesn't work!
Yep, it still fails with an IndexOutOfRange exception. However, since we took your original line dr[j - 1] = excelWorksheet.Cells[i, j].Value != null ? excelWorksheet.Cells[i, j].Value.ToString() : null; and split it into a couple pieces, we can see exactly which part it fails on. And it fails on row[j] = str;, where we actually write the text into the row.
Uh-oh.
MSDN: DataRow Class
Thread Safety
This type is safe for multithreaded read operations. You must synchronize any write operations.
*sigh*. Yeah. Who knows why DataRow uses static anything when assigning values, but there you have it; writing to DataRow isn't thread-safe. And sure enough, doing this...
private static object s_lockObject = "";
private void ReadIntoTable(ExcelWorksheet sheet)
{
// ...
lock (s_lockObject)
{
row[j] = str;
}
// ...
}
...magically makes it work. Granted, it completely destroys the parallelism, but it works.
Well, it almost completely destroys the parallelism. Anecdotal experimentation on an Excel file with 18 columns and 46319 rows shows that the Parallel.For() loop creates its DataTable in about 3.2s on average, whereas replacing Parallel.For() with for (int i = 1; i < height; i++) takes about 3.5s. My guess is that, since the lock is only there for writing data, there is a very small benefit realized by writing data on one thread and processing text on the other(s).
Of course, if you can create your own DataTable replacement class, you can see a much larger speed boost. For example:
string[,] rows = new string[height, width];
Parallel.For(1, height, (i) =>
{
for (int j = 0; j < width; j++)
{
rows[i - 1, j] = (sheet.Cells[i + 1, j + 1].Value ?? "").ToString();
}
});
This executes in about 1.8s on average for the same Excel table mentioned above - about half the time of our barely-parallel DataTable. Replacing the Parallel.For() with the standard for() in this snippet makes it run in about 2.5s.
So you can see a significant performance boost from parallelism, but also from a custom data structure - although the viability of the latter will depend on your ability to easily convert the returned values to that Excel data model thing, whatever it is.
The line dr = dt.Rows.Add(); is not thread safe, you are corrupting the internal state of the array in the DataTable that hold the rows for the table.
At first glance changing it to
if (i > 1)
{
lock (lockObject)
{
dr = dt.Rows.Add();
}
}
should fix it, but that does not mean other thread safety problems are not there from excelWorksheet.Cells being accessed from multiple threads. (If excelWorksheet is this class and you are running a STA main thread (WinForms or WPF) COM should marshal the cross thread calls for you)
EDIT: New thory, the problem comes from the fact that you are setting up your schema inside the parallel loop and attempting to write to it at the same time. Pull out all of the i == 1 logic to before the loop and then start at i == 2
private List<DispatchInvoiceCTNDataModel> WorksheetToDataTableForInvoiceCTN(ExcelWorksheet excelWorksheet, int month, int year)
{
int totalRows = excelWorksheet.Dimension.End.Row;
int totalCols = excelWorksheet.Dimension.End.Column;
DataTable dt = new DataTable(excelWorksheet.Name);
//Build the schema before we loop in parallel.
for (int j = 1; j <= totalCols; j++)
{
var colName = excelWorksheet.Cells[1, j].Value.ToString().Replace(" ", String.Empty);
if (!dt.Columns.Contains(colName))
dt.Columns.Add(colName);
}
Parallel.For(2, totalRows + 1, (i) =>
{
DataRow dr = null;
lock(lockObject) {
dr = dt.Rows.Add();
}
for (int j = 1; j <= totalCols; j++)
{
dr[j - 1] = excelWorksheet.Cells[i, j].Value != null ? excelWorksheet.Cells[i, j].Value.ToString() : null;
}
});
var excelDataModel = dt.ToList<DispatchInvoiceCTNDataModel>();
// now we have mapped everything expect for the IDs
excelDataModel = MapInvoiceCTNIDs(excelDataModel, month, year, excelWorksheet);
return excelDataModel;
}
You code is incorrect:
1) Parallel.For has its own batching mechanism (can be customized with ForEach with partitioners though) and does not guarantee that operation with (for) i==n will be executed after operation with i==m where n>m.
So line
dr[j - 1] = excelWorksheet.Cells[i, j].Value != null ? excelWorksheet.Cells[i, j].Value.ToString() : null;
throw exception when required column is not added yet (in {i==1} operation}
2) And it's recommended to use NewRow method:
dr=tbl.NewRow->Populate dr->tbl.Rows.Add(dr)
or Rows.Add(object[] values):
values=[KnownColumnCount]->Populate values->tbl.Rows.Add(values)
3) It's really better to populate columns first in this case, because it's sequential access to excel file (seek) and It would not harm perfomance
Have you tried using NewRow when creating the new datarow and moving the creation of the columns outside the parallel loop like Scott Chamberlain suggested above? By using newrow you're creating a row with the same schema as the parent datatable. I got the same error as you when I tried your code with a random excel file, but got it to work like this:
for (int x = 1; x <= totalCols; x++)
{
var colName = excelWorksheet.Cells[1, x].Value.ToString().Replace(" ", String.Empty);
if (!dt.Columns.Contains(colName))
dt.Columns.Add(colName);
}
Parallel.For(2, totalRows + 1, (i) =>
{
DataRow dr = null;
for (int j = 1; j <= totalCols; j++)
{
dr = dt.NewRow();
dr[j - 1] = excelWorksheet.Cells[i, j].Value != null
? excelWorksheet.Cells[i, j].Value.ToString()
: null;
lock (lockObject)
{
dt.Rows.Add(dr);
}
}
});
I have problem with deleting steps from scenario in Add-in for Enterprise Architect
I want delete empty steps of scenario from element, but its not working, this "deleted" steps exists in this scenario.
Where is mistake in my code?
short esCnt = element.Scenarios.Count;
for (short esIdx = (short)(esCnt - 1); esIdx >= 0; --esIdx)
{
EA.IDualScenario es = element.Scenarios.GetAt(esIdx);
short essCnt = es.Steps.Count;
for (short essIdx = (short)(essCnt - 1); essIdx >= 0; --essIdx)
{
EA.IDualScenarioStep ess = es.Steps.GetAt(essIdx);
if (ess.Name.Trim().Length == 0 &&
ess.Uses.Trim().Length == 0 &&
ess.Results.Trim().Length == 0)
{
//1. section
es.Steps.Delete(essIdx);
ess.Update();
}
}
//2. section
es.Update();
}
Do you have any ideas?
Looks like off-by-one. The Collection index is zero-based, but the scenario step numbering in the GUI, as reflected in ScenarioStep.Pos, starts from 1. So you might in fact be deleting the wrong steps.
To be on the safe side, you shouldn't use a foreach loop when you're making changes to the collection you're looping over, but a reverse for loop:
int nrScenarios = element.Scenarios.Count;
for (int scenIx = nrScenarios - 1; scenIx >= 0; --scenIx) {
Scenario scenario = element.Scenarios.GetAt(scenIx);
int nrSteps = scenario.Steps.Count;
for (int stepIx = nrSteps - 1; stepIx >= 0; --stepIx) {
In this case, not as important in the outer loop as in the inner, since that's the collection you're manipulating.
Other than that, you shouldn't need to call es.Update() at all, and element.Scenarios.Refresh() should be called outside the inner loop.
Finally, are you sure that Step.Name is actually empty? I'm unable to create steps with empty names ("Action") in the GUI, but you might have been able to do it through the API.
I think the problem lies in the Update() calls.
The EA.Collection.DeleteAt() will immediately delete the element from the database (but not from the collection in memory). If however you call Update() on the object you just created I think it will recreate it, possibly with a new sequence number; which explains why the deleted steps are now "moved" to the end.
Try removing the Update() call and see if that helps.
I'm currently coding a project that can take up to 200 entries of a specific product, as determined by user input. Basically, my GUI loads, and I use jQuery to dynamically build the entries whenever there is a change to the amount field. When using jQuery, I simply give each of them ids in the form of variable1, variable2, ...., variableX (where X is the amount of entries indicated). Small snippet of code to clarify:
for(var i = 1;i <= amount_selected; i++) {
$('table_name tr:last').after('<tr><td><input type="text" id="variable' + i + '"></td></tr>');
}
Now when I try to move to the back end, I'm trying to reference these variable names by putting them in a list. I went ahead and put them in a list of HtmlInputText, to call the Variable names from the list itself. (This would save having to call all (up to 200) methods manually, which is really not an option).
So what I did (in C#) was:
List<HtmlInputText> listvar = new List<HtmlInputText>();
for(int i = 1; i <= amount_selected; i++) {
string j = "variable" + Convert.ToString(i);
HtmlInputText x = j;
listvar.Add((x));
samplemethod(listvar[i]);
}
But it's not working at all. Does anyone have any ideas as to how this would be done, without doing so manually? I know my logic might be completely off, but hopefully this illustrates at least what I'm attempting to do.
I'm assuming these inputs are in a form? If you're submitting then you can access the text boxes from the Request object:
List<string> results = new List<string>();
for (int i = 1; i <= amount_selected; i++)
{
string s = String.Format("{0}", Request.Form["variable" + Convert.ToString(i)]);
results.Add(s);
}
you could do $("#variable" + Convert.ToString(i)).val()