improve performance of a nested loop - c#

I have simplified my program for this example, so I basically load in a file and add the values from the file into a list.
IList<string> MyList = new List<string>();
Main ()
{
foreach(Row r in InputFile)
{
foreach(Cell c in r)
{
AddToList(c.Value);
}
}
}
public void AddToTheList(string value)
{
MyList.Add(value);
}
I am looking to speed up the processing of the loop, I do not care about the order that the values are added.
I am thinking about running the loops in parallel and/or treating the AddToTheList method as an asynchronous fire and forget.
What is the most simple way to make the code use the servers processing power and speed up the total time to process the file?

Update: If the inner loop is heavy enough to make this task CPU-bound (rather than IO-bound), then you could partition the loop using Parallel.ForEach. Here's an example:
Parallel.ForEach(InputFile, row =>
{
foreach(Cell c in row)
AddToList(c.Value);
});
Or, change the AddToList signature to return the value you need, and use PLINQ instead.
MyList = InputFile.AsParallel()
.SelectMany(row => row.AsParallel()
.Select(cell => TransformCell(cell.Value))
.ToList();
public string TransformCell(string value)
{
return value + " something";
}
Making AddToTheList a fire-and-forget async method is almost certainly not a good option. Exceptions thrown by that method would go unhandled, and depending on which framework you're using, these may crash the application.
Parallelizing the calls to AddToTheList is no good - this task is IO-bound.
The bottleneck is in how fast you can read data from disk.
Parallelizing disk access would be no good either. Having two or more threads reading the same file won't be any faster - they'll have to take turns anyway. See this answer to Is it possible to use threads to speed up file reading?
Use as many threads as you have files.

It depends. If parsing rows and cells and adding values to the list is simple, doing things in parallel will not help you - you will be limited I/O, which is a lot slower than the CPU.
However, if parsing the rows takes time, and you're not really adding to a List but rather doing something more complicated, you can read rows from the files, and then handle the rows in parallel - just preallocate the memory for them (List lets you do that) and access each row's List positions in parallel.

Related

How to safely iterate over an IAsyncEnumerable to send a collection downstream for message processing in batches

I've watched the chat on LINQ with IAsyncEnumerable which has given me some insight on dealing with extension methods for IAsyncEnumerables, but wasn't detailed enough frankly for a real-world application, especially for my experience level, and I understand that samples/documentation don't really exist as of yet for IAsyncEnumerables
I'm trying to read from a file, do some transformation on the stream, returning a IAsyncEnumerable, and then send those objects downstream after an arbitrary number of objects have been obtained, like:
await foreach (var data in ProcessBlob(downloadedFile))
{
//todo add data to List<T> called listWithPreConfiguredNumberOfElements
if (listWithPreConfiguredNumberOfElements.Count == preConfiguredNumber)
await _messageHandler.Handle(listWithPreConfiguredNumberOfElements);
//repeat the behaviour till all the elements in the IAsyncEnumerable returned by ProcessBlob are sent downstream to the _messageHandler.
}
My understanding from reading on the matter so far is that the await foreach line is working on data that employs the use of Tasks (or ValueTasks), so we don't have a count up front. I'm also hesitant to use a List variable and just do a length-check on that as sharing that data across threads doesn't seem very thread-safe.
I'm using the System.Linq.Async package in the hopes that I could use a relevant extensions method. I can see some promise in the form of TakeWhile, but my understanding on how thread-safe the task I intend to do is not all there, causing me to lose confidence.
Any help or push in the right direction would be massively appreciated, thank you.
There is an operator Buffer that does what you want, in the package System.Interactive.Async.
// Projects each element of an async-enumerable sequence into consecutive
// non-overlapping buffers which are produced based on element count information.
public static IAsyncEnumerable<IList<TSource>> Buffer<TSource>(
this IAsyncEnumerable<TSource> source, int count);
This package contains operators like Amb, Throw, Catch, Defer, Finally etc that do not have a direct equivalent in Linq, but they do have an equivalent in System.Reactive. This is because IAsyncEnumerables are conceptually closer to IObservables than to IEnumerables (because both have a time dimension, while IEnumerables are timeless).
I'm also hesitant to use a List variable and just do a length-check on that as sharing that data across threads doesn't seem very thread-safe.
You need to think in terms of execution flows, not threads, when dealing with async; since you are await-ing the processing step, there isn't actually a concurrency problem accessing the list, because regardless of which threads are used: the list is only accessed once at a time.
If you are still concerned, you could new a list per batch, but that is probably overkill. What you do need, however, is two additions - a reset between batches, and a final processing step:
var listWithPreConfiguredNumberOfElements = new List<YourType>(preConfiguredNumber);
await foreach (var data in ProcessBlob(downloadedFile)) // CAF?
{
listWithPreConfiguredNumberOfElements.Add(data);
if (listWithPreConfiguredNumberOfElements.Count == preConfiguredNumber)
{
await _messageHandler.Handle(listWithPreConfiguredNumberOfElements); // CAF?
listWithPreConfiguredNumberOfElements.Clear(); // reset for a new batch
// (replace this with a "new" if you're still concerned about concurrency)
}
}
if (listWithPreConfiguredNumberOfElements.Any())
{ // process any stragglers
await _messageHandler.Handle(listWithPreConfiguredNumberOfElements); // CAF?
}
You might also choose to use ConfigureAwait(false) in the three spots marked // CAF?

Why AsParallel().ForAll does not seem to fully take advantage of the cpu usage for my operations?

I have some code which attempts to execute an operation in parallel.
The code is basically
items
.AsParallel()
.ForAll(item =>
{
DoWork(item);
});
Items is a list of 2500 things to process. DoWork() is a 100% CPU calculation with no IO at all. It takes about 1 second however it can vary to some degree.
The problem I am seeing is all cores are being used however they are barely being used. I was thinking it was something to do with the work itself so I tried code like the following and got the same results.
items
.Batches(10)
.AsParallel()
.ForAll(batch =>
{
foreach(var item in batch)
{
DoWork(item);
}
});
I want the utilization to be somewhere near 80% however for the life of me I can't get it there. Not sure what to do.
I have tried Parallel.ForEach with no luck. Tried .WithDegreeOfParallelism(Environment.ProcessorCount * 2) with no luck.
Not sure what to try next.
Are the items IEnumerable?
If so, you may be experiencing poor partitioning. Try wrapping (or passing) items as an IList<T> or just List<T>.
Are the items constructed/computed on the fly or lazily?
Maybe the (sequential) source of the items can not deliver them fast enough to make parallelism profitable.

Does foreach loop work more slowly when used with a not stored list or array?

I am wondered at if foreach loop works slowly if an unstored list or array is used as an in array or List.
I mean like that:
foreach (int number in list.OrderBy(x => x.Value)
{
// DoSomething();
}
Does the loop in this code calculates the sorting every iteration or not?
The loop using stored value:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>;
foreach (int number in list)
{
// DoSomething();
}
And if it does, which code shows the better performance, storing the value or not?
This is often counter-intuitive, but generally speaking, the option that is best for performance is to wait as long as possible to materialize results into a concrete structure like a list or array. Please keep in mind that this is a generalization, and so there are plenty of cases where it doesn't hold. Nevertheless, the first instinct is better when you avoid creating the list for as long as possible.
To demonstrate with your sample, we have these two options:
var list = tours.OrderBy(x => x.Value).ToList();
foreach (int number in list)
{
// DoSomething();
}
vs this option:
foreach (int number in list.OrderBy(x => x.Value))
{
// DoSomething();
}
To understand what is going on here, you need to look at the .OrderBy() extension method. Reading the linked documentation, you'll see it returns a IOrderedEnumerable<TSource> object. With an IOrderedEnumerable, all of the sorting needed for the foreach loop is already finished when you first start iterating over the object (and that, I believe, is the crux of your question: No, it does not re-sort on each iteration). Also note that both samples use the same OrderBy() call. Therefore, both samples have the same problem to solve for ordering the results, and they accomplish it the same way, meaning they take exactly the same amount of time to reach that point in the code.
The difference in the code samples, then, is entirely in using the foreach loop directly vs first calling .ToList(), because in both cases we start from an IOrderedEnumerable. Let's look closely at those differences.
When you call .ToList(), what do you think happens? This method is not magic. There is still code here which must execute in order to produce the list. This code still effectively uses it's own foreach loop that you can't see. Additionally, where once you only needed to worry about enough RAM to handle one object at a time, you are now forcing your program to allocate a new block of RAM large enough to hold references for the entire collection. Moving beyond references, you may also potentially need to create new memory allocations for the full objects, if you were reading a from a stream or database reader before that really only needed one object in RAM at a time. This is an especially big deal on systems where memory is the primary constraint, which is often the case with web servers, where you may be serving and maintaining session RAM for many many sessions, but each session only occasionally uses any CPU time to request a new page.
Now I am making one assumption here, that you are working with something that is not already a list. What I mean by this, is the previous paragraphs talked about needing to convert an IOrderedEnumerable into a List, but not about converting a List into some form of IEnumerable. I need to admit that there is some small overhead in creating and operating the state machine that .Net uses to implement those objects. However, I think this is a good assumption. It turns out to be true far more often than we realize. Even in the samples for this question, we're paying this cost regardless, by the simple virtual of calling the OrderBy() function.
In summary, there can be some additional overhead in using a raw IEnumerable vs converting to a List, but there probably isn't. Additionally, you are almost certainly saving yourself some RAM by avoiding the conversions to List whenever possible... potentially a lot of RAM.
Yes and no.
Yes the foreach statement will seem to work slower.
No your program has the same total amount of work to do so you will not be able to measure a difference from the outside.
What you need to focus on is not using a lazy operation (in this case OrderBy) multiple times without a .ToList or ToArray. In this case you are only using it once(foreach) but it is an easy thing to miss.
Edit: Just to be clear. The as statement in the question will not work as intended but my answer assumes no .ToList() after OrderBy .
This line won't run:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>; // Returns null.
Instead, you want to store the results this way:
List<Tour> list = tours.OrderBy(x => x.Value).ToList();
And yes, the second option (storing the results) will enumerate much faster as it will skip the sorting operation.

Multi Threading with LINQ to SQL

I am writing a WinForms application. I am pulling data from my database, performing some actions on that data set and then plan to save it back to the database. I am using LINQ to SQL to perform the query to the database because I am only concerned with 1 table in our database so I didn't want to implement an entire ORM for this.
I have it pulling the dataset from the DB. However, the dataset is rather large. So currently what I am trying to do is separate the dataset into 4 relatively equal sized lists (List<object>).
Then I have a separate background worker to run through each of those lists, perform the action and report its progress while doing so. I have it planned to consolidate those sections into one big list once all 4 background workers have finished processing their section.
But I keep getting an error while the background workers are processing their unique list. Do the objects maintain their tie to the DataContext for the LINQ to SQL even though they have been converted to List objects? Any ideas how to fix this? I have minimal experience with multi-threading so if I am going at this completely wrong, please tell me.
Thanks guys. If you need any code snippets or any other information just ask.
Edit: Oops. I completely forgot to give the error message. In the DataContext designer.cs it gives the error An item with the same key has already been added. on the SendPropertyChanging function.
private void Setup(){
List<MyObject> quarter1 = _listFromDB.Take(5000).ToList();
bgw1.RunWorkerAsync();
}
private void bgw1_DoWork(object sender, DoWorkEventArgs e){
e.Result = functionToExecute(bgw1, quarter1);
}
private List<MyObject> functionToExecute(BackgroundWorker caller, List<MyObject> myList)
{
int progress = 0;
foreach (MyObject obj in myList)
{
string newString1 = createString();
obj.strText = newString;
//report progress here
caller.ReportProgress(progress++);
}
return myList;
}
This same function is called by all four workers and is given a different list for myList based on which worker is called the function.
Because a real answer has yet to be posted, I'll give it a shot.
Given that you haven't shown any LINQ-to-SQL code (no usage of DataContext) - I'll take an educated guess that the DataContext is shared between the threads, for example:
using (MyDataContext context = new MyDataContext())
{
// this is just some random query, that has not been listed - ToList()
// thus query execution is defered. listFromDB = IQueryable<>
var listFromDB = context.SomeTable.Where(st => st.Something == true);
System.Threading.Tasks.Task.Factory.StartNew(() =>
{
var list1 = listFromDB.Take(5000).ToList(); // runs the SQL query
// call some function on list1
});
System.Threading.Tasks.Task.Factory.StartNew(() =>
{
var list2 = listFromDB.Take(5000).ToList(); // runs the SQL query
// call some function on list2
});
}
Now the error you got - An item with the same key has already been added. - was because the DataContext object is not thread safe! A lot of stuff happens in the background - DataContext has to load objects from SQL, track their states, etc. This background work is what throws the error (because each thread is running the query, the DataContext gets accessed).
At least this is my own personal experience. Having come across the same error while sharing the DataContext between multiple threads. You only have two options in this scenario:
1) Before starting the threads, call .ToList() on the query, making listFromDB not an IQueryable<>, but an actual List<>. This means that the query has already ran and the threads operate on an actual List, not on the DataContext.
2) Move the DataContext definition into each thread. Because the DataContext is no longer shared, no more errors.
The third option would be to re-write the scenario into something else, like you did (for example, make everything sequential on a single background thread)...
First of all, I don't really see why you'd need multiple worker threads at all. (are theses lists in seperate databases / tables / servers? Do you really want to show 4 progress bars if you have 4 lists or are you somehow merging these progress reportings into one weird progress bar:D
Also, you're trying to speed up processing updates to your databases, but you don't send linq to sql any SAVES, so you're not really batching transactions, you'll just save everything at the end in one big transaction, is that really what you're aiming for? the progress bar will just stop at 100% and then spend a lot of time on the SQL side.
Just create one background thread and process everything synchronously, but batch a save transaction every couple of rows (i'd suggest something like every 1000 rows, but you should experiment with this) , it'll be fast, even with millions of rows,
If you really need this multithreaded solution:
The "another blabla with the same key has been added" error suggests that you are adding the same item to multiple "mylists", or adding the same item to the same list twice, otherwise how would there be any errors at all?
Using Parallel LINQ (PLINQ), you can take benefit of multiple CPU cores for processing your data. But if your application is going to run on single-core CPU, then splitting data into peaces wouldn't give you performance benefits instead it will incur some context-change overhead.
Hope it Helps

Threading out inside of loop to improve performance

foreach (int tranQuote in transactionIds)
{
CompassIntegration compass = new CompassIntegration();
Chatham.Business.Objects.Transaction tran = compass.GetTransaction(tranQuote);
// then we want to send each trade through the PandaIntegration
// class with either buildSchedule, fillRates, calcPayments or
// a subset of the three
PandaIntegrationOperationsWrapper wrapper = new PandaIntegrationOperationsWrapper { buildSchedule = false, calcPayments = true, fillRates = true };
new PandaIntegration().RecalculateSchedule(tran, wrapper);
// then we call to save the transaction through the BO
compass.SaveTransaction(tran);
}
Two lines here are taking a very long time. There's about 18k records in transactionIds that I do this for.
The GetTransaction and SaveTransaction are the two lines that take the most time, but I'd honestly just like to thread out what happens inside the loop to improve performance.
What's the best way to thread this out without running into any issues with the CPU or anything like that? I'm not really sure how many threads are safe or how to thread manage or stuff like that.
Thanks guys.
The TPL will provide the necessary throttling and managing.
//foreach (int tranQuote in transactionIds) { ... }
Parallel.ForEach(transactionIds, tranQuote => { ... } );
It does require Fx4 or later, and all the code inside the loop has to be thread-safe.
It's not clear if your GetTransaction and SaveTransaction are safe to be called concurrently.
What do GetTransaction and SaveTransaction actually do that makes them slow? If it's work on the CPU then threading could certainly help. If the slowness comes from making database queries or doing file I/O, threading isn't going to do anything to make your disk or database faster. It might actually slow things down because now you have multiple simultaneous requests going to a resource that is already constrained.
You can use Parallel foreach if order doesn't matter to improve the overall performance

Categories