I have the following TPL function:
int arrayIndex = 0;
Dictionary < string, int > customModel = new Dictionary < string, int > ();
Task task = Task.Factory.StartNew(() =>
// process each employee holiday
Parallel.ForEach < EmployeeHolidaysModel > (holidays,
new ParallelOptions() {
MaxDegreeOfParallelism = System.Enviroment.ProcessorCount
},
item => {
customModel.Add(item.HolidayName, arrayIndex);
// increment the index
arrayIndex++;
})
);
//wait for all Tasks to finish
Task.WaitAll(task);
The problem is that arrayIndex won't have unique values because of the Parallelism.
Is there a way I can control the arrayIndex variable so between parallel tasks the value is unique?
Basically in my customModel I can't have a duplicate arrayIndex value.
Appreciate any help.
Three problems here:
You are writing to shared variables (both the int and the dictionary). This is unsafe. You must either synchronize or use thread-safe collections.
The amount of work that you're doing per iteration is so small that the overhead of parallelism will be multiple orders of magnitude bigger. This is not a good case for parallelism. Expect major slowdowns.
You start a task, then wait for it. What did you meant to accomplish doing that?
I think you need a basic tutorial about threading. These are very basic issues. You won't be having fun using multi-threading at your current level of knowledge...
You'll need to use Interlocked.Increment(). You should probably also use ConcurrentDictionary to be safe, assuming that's not just sample-code you cooked up for the question.
Similarly, the Task isn't necessary here, since you're just waiting on it to finish filling customModel. Obviously, your scenario may be more complex.
But given the code you posted, I'd do something like:
int arrayIndex = 0;
ConcurrentDictionary<string,int> customModel
= new ConcurrentDictionary<string,int>();
Parallel.ForEach<EmployeeHolidaysModel>(
holidays,
new ParallelOptions() {
MaxDegreeOfParallelism = System.Enviroment.ProcessorCount
},
item => customModel.TryAdd(
item.HolidayName,
Interlocked.Increment(ref arrayIndex)
)
);
NowYouCanDoSomethingWith(customModel);
Related
I have 2 loops(nested), trying to do a simple parallelisation
pseudocode:
for item1 in data1 (~100 million row)
for item2 in data2 (~100 rows)
result = process(item1,item2) // couple of if conditions
hashset.add(result) // while adding, incase of a duplicate i also decide wihch one to retain
process(item1,item2) to be precise has 4 if conditions bases on values in item1 and item2.(time taken is less than 50ms)
data1 size is Nx17
data2 size is Nx17
result size is 1x17 (result is joined into a string before it is added into hashset)
max output size: unknown, but i would like to be ready for atleast 500 million
which means the hashset would be holding 500 million items. (how to handle so much data in a hashset would be an another question i guess)
Should i just use a concurrent hashset to make it thread safe and go with parallel.each or should i go with TASK concept
Please provide some code samples based on your opinion.
The answer depends a lot on the costs of process(data1, data2). If this is a CPU-intensive operation, then you can surely benefit from Parallel.ForEach. Of course, you should use concurrent dictionary, or lock around your hash table. You should benchmark to see what works best for you. If process has too little impact on performance, then probably you will get nothing from the parallelization - the locking on the hashtable will kill it all.
You should also try to see whether enumerating data2 on the outer loop is also faster. It might give you another benefit - you can have a separate hashtable for each instance of data2 and then merge the results into one hashtable. This will avoid locks.
Again, you need to do your tests, there is no universal answer here.
My suggestion is to separate the processing of the data from the saving of the results to the HashSet, because the first is parallelizable but the second is not. You could achieve this separation with the producer-consumer pattern, using a BlockingCollection and threads (or tasks). But I'll show a solution using a more specialized tool, the TPL Dataflow library. I'll assume that the data are two arrays of integers, and the processing function can produce up to 500,000,000 different results:
var data1 = Enumerable.Range(1, 100_000_000).ToArray();
var data2 = Enumerable.Range(1, 100).ToArray();
static int Process(int item1, int item2)
{
return unchecked(item1 * item2) % 500_000_000;
}
The dataflow pipeline will have two blocks. The first block is a TransformBlock that accepts an item from the data1 array, processes it with all items of the data2 array, and returns a batch of the results (as an int array).
var processBlock = new TransformBlock<int, int[]>(item1 =>
{
int[] batch = new int[data2.Length];
for (int j = 0; j < data2.Length; j++)
{
batch[j] = Process(item1, data2[j]);
}
return batch;
}, new ExecutionDataflowBlockOptions()
{
BoundedCapacity = 100,
MaxDegreeOfParallelism = 3 // Configurable
});
The second block is and ActionBlock that receives the processed batches from the first block, and adds the individual results in the HashSet.
var results = new HashSet<int>();
var saveBlock = new ActionBlock<int[]>(batch =>
{
for (int i = 0; i < batch.Length; i++)
{
results.Add(batch[i]);
}
}, new ExecutionDataflowBlockOptions()
{
BoundedCapacity = 100,
MaxDegreeOfParallelism = 1 // Mandatory
});
The line below links the two blocks together, so that the data will flow automatically from the first block to the second:
processBlock.LinkTo(saveBlock,
new DataflowLinkOptions() { PropagateCompletion = true });
The last step is to feed the first block with the items of the data1 array, and wait for the completion of the whole operation.
for (int i = 0; i < data1.Length; i++)
{
processBlock.SendAsync(data1[i]).Wait();
}
processBlock.Complete();
saveBlock.Completion.Wait();
The HashSet contains now the results.
A note about using the BoundedCapacity option. This option controls the flow of the data, so that a fast block upstream will not flood with data a slow block downstream. Configuring properly this option increases the memory and CPU efficiency
of the pipeline.
The TPL Dataflow library is built-in the .NET Core, and available as a package for .NET Framework.
Assume I have Producer-Consumer pattern where the consumer can also produce additional work. Essentially, imagine a list with 1000 integers:
var LL = new List<int> {1, 2, 3, ....., 1000};
I want to multi-thread sum - so I am taking 2 numbers at a time, summing them and adding the result back to LL. I would do this until there is only 1 entry left in LL when the last outstanding thread returns.
My experimental code looks like this:
var LL = Enumerable.Range(1, 1000).ToList();
Func<int, int, int> sum = (a, b) => { return a + b; };
object o = new object();
int outstandingThreads = 0;
while (LL.Count > 1 || outstandingThreads > 0)
{
//Note that I set an upper bound of 8 simulateneous Threads
if (LL.Count > 1 && outstandingThreads < 8)
{
var l1 = LL[0];
LL.RemoveAt(0);
var l2 = LL[0];
LL.RemoveAt(0);
Interlocked.Increment(ref outstandingThreads);
var t = Task.Factory.StartNew(() =>
{
var rr = l1 + l2;
// In practice I would use a ConcurrentBag and not explicitly log
lock (o)
{
LL.Add(rr);
}
Interlocked.Decrement(ref outstandingThreads);
}, CancellationToken.None,
TaskCreationOptions.DenyChildAttach,
TaskScheduler.Default);
}
}
I'm scratching my head as this is not working. I get a different result almost every time. I must be hitting a race condition that I cannot see. Please note, that processing a List is not my actual test case, just a simplification. If there's a better pattern I could be using, I'm also all ears. Multithreading, as you can see, is not my forte.
You've got a lock around Add, but RemoveAt is also modification of the list.
Why no lock around that?
A race may happen between .Add from worker thread and .RemoveAt from main thread, and it could screw up the .Count property that the List caches (calculating .Count by walking the whole list would be an overkill, so the List caches it for sure), as both Add and Remove do two things: modify the list items and update the .Count, even if it doesn't crash, it may get messed up, so yeah, I think that's it.
I have a List to loop while using multi-thread,I will get the first item of the List and do some processing,then remove the item.
While the count of List is not greater than 0 ,fetch data from data.
In a word:
In have a lot of records in my database.I need to publish them to my server.In the process of publishing, multithreading is required and the number of threads may be 10 or less.
For example:
private List<string> list;
void LoadDataFromDatabase(){
list=...;//load data from database...
}
void DoMethod()
{
While(list.Count>0)
{
var item=list.FirstOrDefault();
list.RemoveAt(0);
DoProcess();//how to use multi-thread (custom the count of theads)?
if(list.Count<=0)
{
LoadDataFromDatabase();
}
}
}
Please help me,I'm a beginner of c#,I have searched a lot of solutions, but no similar.
And more,I need to custom the count of theads.
Should your processing of the list be sequential? In other words, cannot you process element n + 1 while not finished yet processing of element n? If this is your case, then Multi-Threading is not the right solution.
Otherwise, if your processing elements are fully independent, you can use m threads, deviding Elements.Count / m elements for each thread to work on
Example: printing a list:
List<int> a = new List<int> { 1, 2, 3, 4,5 , 6, 7, 8, 9 , 10 };
int num_threads = 2;
int thread_elements = a.Count / num_threads;
// start the threads
Thread[] threads = new Thread[num_threads];
for (int i = 0; i < num_threads; ++i)
{
threads[i] = new Thread(new ThreadStart(Work));
threads[i].Start(i);
}
// this works fine if the total number of elements is divisable by num_threads
// but if we have 500 elements, 7 threads, then thread_elements = 500 / 7 = 71
// but 71 * 7 = 497, so that there are 3 elements not processed
// process them here:
int actual = thread_elements * num_threads;
for (int i = actual; i < a.Count; ++i)
Console.WriteLine(a[i]);
// wait all threads to finish
for (int i = 0; i < num_threads; ++i)
{
threads[i].Join();
}
void Work(object arg)
{
Console.WriteLine("Thread #" + arg + " has begun...");
// calculate my working range [start, end)
int id = (int)arg;
int mystart = id * thread_elements;
int myend = (id + 1) * thread_elements;
// start work on my range !!
for (int i = mystart; i < myend; ++i)
Console.WriteLine("Thread #" + arg + " Element " + a[i]);
}
ADD For your case, (uploading to server), it is the same as the code obove. You assign a number of threads, assigning each thread number of elements (which is auto calculated in the variable thread_elements, so you need only to change num_threads). For method Work, all you need is replacing the line Console.WriteLine("Thread #" + arg + " Element " + a[i]); with you uploading code.
One more thing to keep in mind, that multi-threading is dependent on your machine CPU. If your CPU has 4 cores, for example, then the best performance obtained would be 4 threads at maximum, so that assigning each core a thread. Otherwise, if you have 10 threads, for example, they would be slower than 4 threads because they will compete on CPU cores (Unless the threads are idle, waiting for some event to occur (e.g. uploading). In this case, 10 threads can run, because they don't take %100 of CPU usage)
WARNING: DO NOT modify the list while any thread is working (add, remove, set element...), neither assigning two threads the same element. Such things cause you a lot of bugs and exceptions !!!
This is a simple scenario that can be expanded in multiple ways if you add some details to your requirements:
IEnumerable<Data> LoadDataFromDatabase()
{
return ...
}
void ProcessInParallel()
{
while(true)
{
var data = LoadDataFromDatabase().ToList();
if(!data.Any()) break;
data.AsParallel().ForEach(ProcessSingleData);
}
}
void ProcessSingleData(Data d)
{
// do something with data
}
There are many ways to approach this. You can create threads and partition the list yourself or you can take advantage of the TPL and utilize Parallel.ForEach. In the example on the link you see a Action is called for each member of the list being iterated over. If this is your first taste of threading I would also attempt to do it the old fashioned way.
Here my opinion ;)
You can avoid use multithread if youur "List" is not really huge.
Instead of a List, you can use a Queue (FIFO - First In First Out). Then only use Dequeue() method to get one element of the Queue, DoSomeWork and get the another. Something like:
while(queue.Count > 0)
{
var temp = DoSomeWork(queue.Dequeue());
}
I think that this will be better for your propose.
I will get the first item of the List and do some processing,then remove the item.
Bad.
First, you want a queue, not a list.
Second, you do not process then remove, you remove THEN process.
Why?
So that you keep the locks small. Lock list access (note you need to synchonize access), remove, THEN unlock immediately and then process. THis way you keep the locks short. If you take, process, then remove - you basically are single threaded as you have to keep the lock in place while processing, so the next thread does not take the same item again.
And as you need to synchronize access and want multiple threads this is about the only way.
Read up on the lock statement for a start (you can later move to something like spinlock). Do NOT use threads unless you ahve to put schedule Tasks (using the Tasks interface new in 4.0), which gives you more flexibility.
I have a problem when multiple threads try to increase int. Here's my code:
private int _StoreIndex;
private readonly List<Store> _Stores = new List<Store>();
public void TestThreads()
{
_StoreIndex = 0;
for (int i = 0; i < 20; i++)
{
Thread thread = new Thread(() =>
{
while (_StoreIndex < _Stores.Count - 1)
{
_Stores[Interlocked.Increment(ref _StoreIndex)].CollectData();
}
});
thread.Start();
}
}
I would expect that int gets increased by one each time the thread executes this code. However, it does not. I have also tried using lock (new object()), but this doesn't work as well. The problem is that not all the stores collect data because (when debugging), _StoreIndex goes like 0, 1, 1, 3, 4, 5, for example. The second object in the list is obviously skipped.
What am I doing wrong? Thanks in advance.
In your case I would use the TPL to avoid all of these problems with manual thread creation and indexes in the first place:
Parallel.ForEach(_Stores, (store) => store.CollectData());
I think it should be corrected to:
Thread thread = new Thread(() =>
{
int index = 0;
while ((index = Interlocked.Increment(ref _StoreIndex)) < _Stores.Count - 1)
{
_Stores[index].CollectData();
}
});
Now index is local, so there is no interference, while _StoreIndex is only used atomically in a single place.
This is not an atomic operation:
_Stores[Interlocked.Increment(ref _StoreIndex)].CollectData();
Increment is atomic, but this line contains more code than a simple increment. You may need to sort out your indeces first, then use thread safe collection to hold your stores, like ConcurrentBag and perhaps consider TPL library and classes like Task and Parallel to perform the workload.
I have a big List that may have some 50,000 or more items and i have to do operation against each item.takes some X time now if i use conventional method and do operation in sequential manner it is definitely take X * 50,000 on average.
I planned to optimize and save some time and decided to use Background Worker as there is no dependency among them.Plan was to divide the List in 4 parts and use each in separate Background Worker.
I want to ASk
1.is this method DUMB?
2.Is there any other Better Method?
3.Suggest a nice and clean method to divide List in 4 equal Parts?
Thanks
If you can use .Net 4.0, then use the Task Parallel library and have a look at
Parallel.ForEach()
Parallel ForEach How-to.
Everything is basically the same as a traditional for loop, but you work with parallelism implicitly.
You can also really split it to groups.
I didn't see a built-in sequence method for it, so here's the low level way. point out any blunders please. I am learning.
static List<T[]> groups<T>(IList<T> original, uint n)
{
Debug.Assert(n > 0);
var listlist = new List<T[]>();
var list = new List<T>();
for (int i = 0; i < original.Count(); i++)
{
var item = original[i];
list.Add(item);
if ((i+1) % n == 0 || i == original.Count() - 1)
{
listlist.Add(list.ToArray());
list.Clear();
}
}
return listlist;
}
Another version, based on linq.
public static List<T[]> groups<T>(IList<T> original, uint n)
{
var almost_grouped = original.Select((row, i) => new { Item = row, GroupIndex = i / n });
var groups = almost_grouped.GroupBy(a => a.GroupIndex, a => a.Item);
var grouped = groups.Select(a => a.ToArray()).ToList();
return grouped;
}
This is a good method for optimizing similar, independent, operations on a large collection. However, you should look at the Parallel.For method in .NET 4.0. It does all the heavy lifting for you:
http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.for.aspx