C# MultiThreading Loop entire DataTable while limiting threads to 4 - c#

This may be a tricky question to ask, but what I have is a DataTable that contains 1000 rows. Foreach of these rows I want to process on a new thread. However I want to limit the threads to 4 threads. So basically I'm constently keeping 4 threads running until the whole datatable has been processed.
currently I have this;
foreach (DataRow dtRow in urlTable.Rows)
{
for (int i = 0; i < 4; i++)
{
Thread thread = new Thread(() => MasterCrawlerClass.MasterCrawlBegin(dtRow));
thread.Start();
}
}
I know this is backwards but i'm not sure how to achieve what I'm looking for. I thought of a very complicated while loop but maybe that's not the best way? Any help is always appreciated.

Simplest solution would be in case you have 4 CPU cores - Parallel LINQ +Degree of parallelism == 4 would give you one threads per CPU core, otherwise you have manually distribute records between threads/tasks, see both solutions below:
PLINQ solution:
urlTable.Rows.AsParallel().WithDegreeOfParallelism(4)
.Select(....)
Manual distribution:
You can distribute items by worker threads manually using simple trick:
N-thread would pick up each N+4 item from the input list, for instance:
First thread: Each0+4 == 0, 3, 7...
Second: Each1+4 == 1, 4, 8...
Third: Each2+4 == ...
Task Parallel Library solution:
private void ProcessItems(IEnumerable<string> items)
{
// TODO: ..
}
var items = new List<string>(Enumerable.Range(0, 1000)
.Select(i => i + "_ITEM"));
var items1 = items.Where((item, index) => (index + 0) % 4 == 0);
var items2 = items.Where((item, index) => (index + 1) % 4 == 0);
var items3 = items.Where((item, index) => (index + 2) % 4 == 0);
var items4 = items.Where((item, index) => (index + 3) % 4 == 0);
var tasks = new Task[]
{
factory.StartNew(() => ProcessItems((items1))),
factory.StartNew(() => ProcessItems((items2))),
factory.StartNew(() => ProcessItems((items3))),
factory.StartNew(() => ProcessItems((items4)))
};
Task.WaitAll(tasks);
MSDN:
WithDegreeOfParallelism():
Introduction to PLINQ

Related

How to increase perfomance for loop using c#

I compare task data from Microsoft project using a nested for loop. But since the project has many records (more than 1000), it is very slow.
How do I improve the performance?
for (int n = 1; n < thisProject.Tasks.Count; n++)
{
string abc = thisProject.Tasks[n].Name;
string def = thisProject.Tasks[n].ResourceNames;
for (int l = thisProject.Tasks.Count; l > n; l--)
{
// MessageBox.Show(thisProject.Tasks[l].Name);
if (abc == thisProject.Tasks[l].Name && def == thisProject.Tasks[l].ResourceNames)
{
thisProject.Tasks[l].Delete();
}
}
}
As you notice, I am comparing the Name and ResourceNames on the individual Task and when I find a duplicate, I call Task.Delete to get rid of the duplicate
A hash check should be lot faster in this case then nested-looping i.e. O(n) vs O(n^2)
First, provide a equality comparer of your own
class TaskComparer : IEqualityComparer<Task> {
public bool Equals(Task x, Task y) {
if (ReferenceEquals(x, y)) return true;
if (ReferenceEquals(x, null)) return false;
if (ReferenceEquals(y, null)) return false;
if (x.GetType() != y.GetType()) return false;
return string.Equals(x.Name, y.Name) && string.Equals(x.ResourceNames, y.ResourceNames);
}
public int GetHashCode(Task task) {
unchecked {
return
((task?.Name?.GetHashCode() ?? 0) * 397) ^
(task?.ResourceNames?.GetHashCode() ?? 0);
}
}
}
Don't worry too much about the GetHashCode function implementation; this is just a broiler-plate code which composes a unique hash-code from its properties
Now you have this class for comparison and hashing, you can use the below code to remove your dupes
var set = new HashSet<Task>(new TaskComparer());
for (int i = thisProject.Tasks.Count - 1; i >= 0; --i) {
if (!set.Add(thisProject.Tasks[i]))
thisProject.Tasks[i].Delete();
}
As you notice, you are simply scanning all your elements, while storing them into a HashSet. This HashSet will check, based on our equality comparer, if the provided element is a duplicate or not.
Now, since you want to delete it, the detected dupes are deleted. You can modify this code to simply extract the Unique items instead of deleting the dupes, by reversing the condition to if (set.Add(thisProject.Tasks[i])) and processing within this if
Microsoft Project has a Sort method which makes simple work of this problem. Sort the tasks by Name, Resource Names, and Unique ID and then loop through comparing adjacent tasks and delete duplicates. By using Unique ID as the third sort key you can be sure to delete the duplicate that was added later. Alternatively, you can use the task ID to remove tasks that are lower down in the schedule. Here's a VBA example of how to do this:
Sub RemoveDuplicateTasks()
Dim proj As Project
Set proj = ActiveProject
Application.Sort Key1:="Name", Ascending1:=True, Key2:="Resource Names", Ascending2:=True, Key3:="Unique ID", Ascending3:=True, Renumber:=False, Outline:=False
Application.SelectAll
Dim tsks As Tasks
Set tsks = Application.ActiveSelection.Tasks
Dim i As Integer
Do While i < tsks.Count
If tsks(i).Name = tsks(i + 1).Name And tsks(i).ResourceNames = tsks(i + 1).ResourceNames Then
tsks(i + 1).Delete
Else
i = i + 1
End If
Loop
Application.Sort Key1:="ID", Renumber:=False, Outline:=False
Application.SelectBeginning
End Sub
Note: This question relates to algorithm, not syntax; VBA is easy to translate to c#.
This should give you all the items which are duplicates, so you can delete them from your original list.
thisProject.Tasks.GroupBy(x => new { x.Name, x.ResourceNames}).Where(g => g.Count() > 1).SelectMany(g => g.Select(c => c));
Note that you probably do not want to remove all of them, only the duplicate versions, so be careful how you loop through this list.
A Linq way of getting distinct elements from your Tasks list :
public class Task
{
public string Name {get;set;}
public string ResourceName {get;set;}
}
public class Program
{
public static void Main()
{
List<Task> Tasks = new List<Task>();
Tasks.Add(new Task(){Name = "a",ResourceName = "ra"});
Tasks.Add(new Task(){Name = "b",ResourceName = "rb"});
Tasks.Add(new Task(){Name = "c",ResourceName = "rc"});
Tasks.Add(new Task(){Name = "a",ResourceName = "ra"});
Tasks.Add(new Task(){Name = "b",ResourceName = "rb"});
Tasks.Add(new Task(){Name = "c",ResourceName = "rc"});
Console.WriteLine("Initial List :");
foreach(var t in Tasks){
Console.WriteLine(t.Name);
}
// Here comes the interesting part
List<Task> Tasks2 = Tasks.GroupBy(x => new {x.Name, x.ResourceName})
.Select(g => g.First()).ToList();
Console.WriteLine("Final List :");
foreach(Task t in Tasks2){
Console.WriteLine(t.Name);
}
}
}
This selects every first elements having the same Name and ResourceName.
Run the example here.

Nested Threads (Tasks) Timing Out Prematurely

I have the following code, what it does I don't believe is important, but I'm getting strange behavior.
When I run just the months on separate threads, it runs fine(how it is below), but when I multi-thread the years(uncomment the tasks), it will timeout every time. The timeout is set for 5 minutes for months/20 minutes for years and it will timeout within a minute.
Is there a known reason for this behavior? Am I missing something simple?
public List<PotentialBillingYearItem> GeneratePotentialBillingByYear()
{
var years = new List<PotentialBillingYearItem>();
//var tasks = new List<Task>();
var startYear = new DateTime(DateTime.Today.Year - 10, 1, 1);
var range = new DateRange(startYear, DateTime.Today.LastDayOfMonth());
for (var i = range.Start; i <= range.End; i = i.AddYears(1))
{
var yearDate = i;
//tasks.Add(Task.Run(() =>
//{
years.Add(new PotentialBillingYearItem
{
Total = GeneratePotentialBillingMonths(new PotentialBillingParameters { Year = yearDate.Year }).Average(s => s.Total),
Date = yearDate
});
//}));
}
//Task.WaitAll(tasks.ToArray(), TimeSpan.FromMinutes(20));
return years;
}
public List<PotentialBillingItem> GeneratePotentialBillingMonths(PotentialBillingParameters Parameters)
{
var items = new List<PotentialBillingItem>();
var tasks = new List<Task>();
var year = new DateTime(Parameters.Year, 1, 1);
var range = new DateRange(year, year.LastDayOfYear());
range.Start = range.Start == range.End ? DateTime.Now.FirstDayOfYear() : range.Start.FirstDayOfMonth();
if (range.End > DateTime.Today) range.End = DateTime.Today.LastDayOfMonth();
for (var i = range.Start; i <= range.End; i = i.AddMonths(1))
{
var firstDayOfMonth = i;
var lastDayOfMonth = i.LastDayOfMonth();
var monthRange = new DateRange(firstDayOfMonth, lastDayOfMonth);
tasks.Add(Task.Run(() =>
{
using (var db = new AlbionConnection())
{
var invoices = GetInvoices(lastDayOfMonth);
var timeslipSets = GetTimeslipSets();
var item = new PotentialBillingItem
{
Date = firstDayOfMonth,
PostedInvoices = CalculateInvoiceTotals(invoices.Where(w => w.post_date <= lastDayOfMonth), monthRange),
UnpostedInvoices = CalculateInvoiceTotals(invoices.Where(w => w.post_date == null || w.post_date > lastDayOfMonth), monthRange),
OutstandingDrafts = CalculateOutstandingDraftTotals(timeslipSets)
};
items.Add(item);
}
}));
}
Task.WaitAll(tasks.ToArray(), TimeSpan.FromMinutes(5));
return items;
}
You might consider pre-allocating a bigger number of threadpool threads. The threadpool is very slow to allocate new threads. The code below task only 10 seconds (the theoretical minimum) to run setting the minimum number of threadpool threads to 2.5k, but commenting out the SetMinThreads makes it take over 1:30 seconds.
static void Main(string[] args)
{
ThreadPool.SetMinThreads(2500, 10);
Stopwatch sw = Stopwatch.StartNew();
RunTasksOutter(10);
sw.Stop();
Console.WriteLine($"Finished in {sw.Elapsed}");
}
public static void RunTasksOutter(int num) => Task.WaitAll(Enumerable.Range(0, num).Select(x => Task.Run(() => RunTasksInner(10))).ToArray());
public static void RunTasksInner(int num) => Task.WaitAll(Enumerable.Range(0, num).Select(x => Task.Run(() => Thread.Sleep(10000))).ToArray());
You could also be running out of threadpool threads. Per: https://msdn.microsoft.com/en-us/library/0ka9477y(v=vs.110).aspx one of the times to not use the threadpool (which is used by tasks) is:
You have tasks that cause the thread to block for long periods of time. The thread pool has a maximum number of threads, so a large number of blocked thread pool threads might prevent tasks from starting.
Since IO is being done on these threads maybe consider replacing them with async code or starting them with the LongRunning option? https://msdn.microsoft.com/en-us/library/system.threading.tasks.taskcreationoptions(v=vs.110).aspx

Optimize performance of a Parallel.For

I have replaced a for loop in my code with a Parallel.For. The performance improvement is awesome (1/3 running time). I've tried to account for shared resources using an array to gather result codes. I then process the array out side the Parallel.For. Is this the most efficient way or will blocking still occur even if no iteration can ever share the same loop-index? Would a CompareExchange perform much better?
int[] pageResults = new int[arrCounter];
Parallel.For(0, arrCounter, i =>
{
AlertToQueueInput input = new AlertToQueueInput();
input.Message = Messages[i];
pageResults[i] = scCommunication.AlertToQueue(input).ReturnCode;
});
foreach (int r in pageResults)
{
if (r != 0 && outputPC.ReturnCode == 0) outputPC.ReturnCode = r;
}
It depends on whether you have any (valuable) side-effects in the main loop.
When the outputPC.ReturnCode is the only result, you can use PLINQ:
outputPC.ReturnCode = Messages
.AsParallel()
.Select(msg =>
{
AlertToQueueInput input = new AlertToQueueInput();
input.Message = msg;
return scCommunication.AlertToQueue(input).ReturnCode;
})
.FirstOrDefault(r => r != 0);
This assumes scCommunication.AlertToQueue() is thread-safe and you don't want to call it for the remaining items after the first error.
Note that FirstOrDefault() in PLinq is only efficient in Framework 4.5 and later.
You could replace:
foreach (int r in pageResults)
{
if (r != 0 && outputPC.ReturnCode == 0) outputPC.ReturnCode = r;
}
with:
foreach (int r in pageResults)
{
if (r != 0)
{
outputPC.ReturnCode = r;
break;
}
}
This will then stop the loop on the first fail.
I like David Arno's solution, but as I see you can improve the speed with putting the check inside the parallel loop and breaking directly from it. Anyway you put the main code to fail if any of iterations failed , so there is no need for further iterations.
Something like this:
Parallel.For(0, arrCounter, (i, loopState) =>
{
AlertToQueueInput input = new AlertToQueueInput();
input.Message = Messages[i];
var code = scCommunication.AlertToQueue(input).ReturnCode;
if (code != 0)
{
outputPC.ReturnCode = code ;
loopState.Break();
}
});
Upd 1:
If you need to save the result of all iterations you can do something like this:
int[] pageResults = new int[arrCounter];
Parallel.For(0, arrCounter, (i, loopState) =>
{
AlertToQueueInput input = new AlertToQueueInput();
input.Message = Messages[i];
var code = scCommunication.AlertToQueue(input).ReturnCode;
pageResults[i] = code ;
if (code != 0 && outputPC.ReturnCode == 0)
outputPC.ReturnCode = code ;
});
It will save you from the foreach loop which is an improvement although a small one.
UPD 2:
just found this post and I think custom parallel is a good solution too. But it's your call to decide if it fits to your task.

Does Linq provide a way to easily spot gaps in a sequence?

I am managing a directory of files. Each file will be named similarly to Image_000000.png, with the numeric portion being incremented for each file that is stored.
Files can also be deleted, leaving gaps in the number sequence. The reason I am asking is because I recognize that at some point in the future, the user could use up the number sequence unless I takes steps to reuse numbers when they become available. I realize that it is a million, and that's a lot, but we have 20-plus year users, so "someday" is not out of the question.
So, I am specifically asking whether or not there exists a way to easily determine the gaps in the sequence without simply looping. I realize that because it's a fixed range, I could simply loop over the expected range.
And I will unless there is a better/cleaner/easier/faster alternative. If so, I'd like to know about it.
This method is called to obtain the next available file name:
public static String GetNextImageFileName()
{
String retFile = null;
DirectoryInfo di = new DirectoryInfo(userVars.ImageDirectory);
FileInfo[] fia = di.GetFiles("*.*", SearchOption.TopDirectoryOnly);
String lastFile = fia.Where(i => i.Name.StartsWith("Image_") && i.Name.Substring(6, 6).ContainsOnlyDigits()).OrderBy(i => i.Name).Last().Name;
if (!String.IsNullOrEmpty(lastFile))
{
Int32 num;
String strNum = lastFile.Substring(6, 6);
String strExt = lastFile.Substring(13);
if (!String.IsNullOrEmpty(strNum) &&
!String.IsNullOrEmpty(strExt) &&
strNum.ContainsOnlyDigits() &&
Int32.TryParse(strNum, out num))
{
num++;
retFile = String.Format("Image_{0:D6}.{1}", num, strExt);
while (num <= 999999 && File.Exists(retFile))
{
num++;
retFile = String.Format("Image_{0:D6}.{1}", num, strExt);
}
}
}
return retFile;
}
EDIT: in case it helps anyone, here is the final method, incorporating Daniel Hilgarth's answer:
public static String GetNextImageFileName()
{
DirectoryInfo di = new DirectoryInfo(userVars.ImageDirectory);
FileInfo[] fia = di.GetFiles("Image_*.*", SearchOption.TopDirectoryOnly);
List<Int32> fileNums = new List<Int32>();
foreach (FileInfo fi in fia)
{
Int32 i;
if (Int32.TryParse(fi.Name.Substring(6, 6), out i))
fileNums.Add(i);
}
var result = fileNums.Select((x, i) => new { Index = i, Value = x })
.Where(x => x.Index != x.Value)
.Select(x => (Int32?)x.Index)
.FirstOrDefault();
Int32 index;
if (result == null)
index = fileNums.Count - 1;
else
index = result.Value - 1;
var nextNumber = fileNums[index] + 1;
if (nextNumber >= 0 && nextNumber <= 999999)
return String.Format("Image_{0:D6}", result.Value);
return null;
}
A very simple approach to find the first number of the first gap would be the following:
int[] existingNumbers = /* extract all numbers from all filenames and order them */
var allNumbers = Enumerable.Range(0, 1000000);
var result = allNumbers.Where(x => !existingNumbers.Contains(x)).First();
This will return 1,000,000 if all numbers have been used and no gaps exist.
This approach has the drawback that it performs rather badly, as it iterates existingNumbers multiple times.
A somewhat better approach would be to use Zip:
allNumbers.Zip(existingNumbers, (a, e) => new { Number = a, ExistingNumber = e })
.Where(x => x.Number != x.ExistingNumber)
.Select(x => x.Number)
.First();
An improved version of DuckMaestro's answer that actually returns the first value of the first gap - and not the first value after the first gap - would look like this:
var tmp = existingNumbers.Select((x, i) => new { Index = i, Value = x })
.Where(x => x.Index != x.Value)
.Select(x => (int?)x.Index)
.FirstOrDefault();
int index;
if(tmp == null)
index = existingNumbers.Length - 1;
else
index = tmp.Value - 1;
var nextNumber = existingNumbers[index] + 1;
Improving over the other answer, use the alternate version of Where.
int[] existingNumbers = ...
var result = existingNumbers.Where( (x,i) => x != i ).FirstOrDefault();
The value i is a counter starting at 0.
This version of where is supported in .NET 3.5 (http://msdn.microsoft.com/en-us/library/bb549418(v=vs.90).aspx).
var firstnonexistingfile = Enumerable.Range(0,999999).Select(x => String.Format("Image_{0:D6}.{1}", x, strExt)).FirstOrDefault(x => !File.Exists(x));
This will iterate from 0 to 999999, then output the result of the String.Format() as an IEnumerable<string> and then find the first string out of that sequence that returns false for File.Exists().
It's an old question, but it has been suggested (in the comments) that you could use .Except() instead. I tend to like this solution a little better since it will give you the first missing number (the gap) or the next smallest number in the sequence. Here's an example:
var allNumbers = Enumerable.Range(0, 999999); //999999 is arbitrary. You could use int.MaxValue, but it would degrade performance
var existingNumbers = new int[] { 0, 1, 2, 4, 5, 6 };
int result;
var missingNumbers = allNumbers.Except(existingNumbers);
if (missingNumbers.Any())
result = missingNumbers.First();
else //no missing numbers -- you've reached the max
result = -1;
Running the above code would set result to:
3
Additionally, if you changed existingNumbers to:
var existingNumbers = new int[] { 0, 1, 3, 2, 4, 5, 6 };
So there isn't a gap, you would get 7 back.
Anyway, that's why I prefer Except over the Zip solution -- just my two cents.
Thanks!

Null Reference while handling a List in multiple threads

Basically, i have a collection of objects, i am chopping it into small collections, and doing some work on a thread over each small collection simultaneously.
int totalCount = SomeDictionary.Values.ToList().Count;
int singleThreadCount = (int)Math.Round((decimal)(totalCount / 10));
int lastThreadCount = totalCount - (singleThreadCount * 9);
Stopwatch sw = new Stopwatch();
Dictionary<int,Thread> allThreads = new Dictionary<int,Thread>();
List<rCode> results = new List<rCode>();
for (int i = 0; i < 10; i++)
{
int count = i;
if (i != 9)
{
Thread someThread = new Thread(() =>
{
List<rBase> objects = SomeDictionary.Values
.Skip(count * singleThreadCount)
.Take(singleThreadCount).ToList();
List<rCode> result = objects.Where(r => r.ZBox != null)
.SelectMany(r => r.EffectiveCBox, (r, CBox) => new rCode
{
RBox = r,
// A Zbox may refer an object that can be
// shared by many
// rCode objects even on different threads
ZBox = r.ZBox,
CBox = CBox
}).ToList();
results.AddRange(result);
});
allThreads.Add(i, someThread);
someThread.Start();
}
else
{
Thread someThread = new Thread(() =>
{
List<rBase> objects = SomeDictionary.Values
.Skip(count * singleThreadCount)
.Take(lastThreadCount).ToList();
List<rCode> result = objects.Where(r => r.ZBox != null)
.SelectMany(r => r.EffectiveCBox, (r, CBox) => new rCode
{
RBox = r,
// A Zbox may refer an object that
// can be shared by many
// rCode objects even on different threads
ZBox = r.ZBox,
CBox = CBox
}).ToList();
results.AddRange(result);
});
allThreads.Add(i, someThread);
someThread.Start();
}
}
sw.Start();
while (allThreads.Values.Any(th => th.IsAlive))
{
if (sw.ElapsedMilliseconds >= 60000)
{
results = null;
allThreads.Values.ToList().ForEach(t => t.Abort());
sw.Stop();
break;
}
}
return results != null ? results.OrderBy(r => r.ZBox.Name).ToList():null;
so, My issue is that SOMETIMES, i get a null reference exception while performing the OrderBy operation before returning the results, and i couldn't determine where is the exception exactly, i press back, click the same button that does this operation on the same data again, and it works !! .. If someone can help me identify this issue i would be more than gratefull. NOTE :A Zbox may refer an object that can be shared by many rCode objects even on different threads, can this be the issue ?
as i can't determine this upon testing, because the error happening is not deterministic.
The bug is correctly found in the chosen answer although I do not agree with the answer. You should switch to using a concurrent collection. In your case a ConcurrentBag or ConcurrentQueue. Some of which are (partially) lockfree for better performance. And they provide more readable and less code since you do not need manual locking.
Your code would also more than halve in size and double in readability if you keep from manually created threads and manual paritioning;
Parallel.ForEach(objects, MyObjectProcessor);
public void MyObjectProcessor(Object o)
{
// Create result and add to results
}
Use a ParallelOptions object if you want to limit the number of threads with Parallel.ForEach............
Well, one obvious problem is here:
results.AddRange(result);
where you're updating a list from multiple threads. Try using a lock:
object resultsLock = new object(); // globally visible
...
lock(resultsLock)
{
results.AddRange(result);
}
I suppose the problem in results = null
while (allThreads.Values.Any(th => th.IsAlive))
{ if (sw.ElapsedMilliseconds >= 60000) { results = null; allThreads.Values.ToList().ForEach(t => t.Abort());
if the threads not finised faster than 60000 ms you results become equal null and you can't call results.OrderBy(r => r.ZBox.Name).ToList(); it's throws exception
you should add something like that
if (results != null)
return results.OrderBy(r => r.ZBox.Name).ToList();
else
return null;

Categories