I have a C# requirement for individually processing a 'great many' (perhaps > 100,000) records. Running this process sequentially is proving to be very slow with each record taking a good second or so to complete (with a timeout error set at 5 seconds).
I would like to try running these tasks asynchronously by using a set number of worker 'threads' (I use the term 'thread' here cautiously as I am not sure if I should be looking at a thread, or a task or something else).
I have looked at the ThreadPool, but I can't imagine it could queue the volume of requests required. My ideal pseudo code would look something like this...
public void ProcessRecords() {
SetMaxNumberOfThreads(20);
MyRecord rec;
while ((rec = GetNextRecord()) != null) {
var task = WaitForNextAvailableThreadFromPool(ProcessRecord(rec));
task.Start()
}
}
I will also need a mechanism that the processing method can report back to the parent/calling class.
Can anyone point me in the right direction with perhaps some example code?
A possible simple solution would be to use a TPL Dataflow block which is a higher abstraction over the TPL with configurations for degree of parallelism and so forth. You simply create the block (ActionBlock in this case), Post everything to it, wait asynchronously for completion and TPL Dataflow handles all the rest for you:
var block = new ActionBlock<MyRecord>(
rec => ProcessRecord(rec),
new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = 20});
MyRecord rec;
while ((rec = GetNextRecord()) != null)
{
block.Post(rec);
}
block.Complete();
await block.Completion
Another benefit is that the block starts working as soon as the first record arrives and not only when all the records have been received.
If you need to report back on each record you can use a TransformBlock to do the actual processing and link an ActionBlock to it that does the updates:
var transform = new TransfromBlock<MyRecord, Report>(rec =>
{
ProcessRecord(rec);
return GenerateReport(rec);
}, new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = 20});
var reporter = new ActionBlock<Report>(report =>
{
RaiseEvent(report) // Or any other mechanism...
});
transform.LinkTo(reporter, new DataflowLinkOptions { PropagateCompletion = true });
MyRecord rec;
while ((rec = GetNextRecord()) != null)
{
transform.Post(rec);
}
transform.Complete();
await transform.Completion
Have you thought about using parallel processing with Actions?
ie, create a method to process a single record, add each record method as an action into a list, and then perform a parrallel.for on the list.
Dim list As New List(Of Action)
list.Add(New Action(Sub() MyMethod(myParameter)))
Parallel.ForEach(list, Sub(t) t.Invoke())
This is in vb.net, but I think you get the gist.
Related
I am using IBackgroundJobManager with Hangfire integration.
Use case:
I am processing a single uploaded file. After the file is saved, I would like to start two separate Abp.BackgroundJobs sequentially. Only after the first job completes, should the second job start.
Here is my code:
var measureJob1 = await _backgroundJobManager.EnqueueAsync<FileProcessBackgroundJob, FileProcessJobArgsDto>(
new FileProcessJobArgsDto
{
Id = Id,
User = user,
})
.ContinueWith<AnalyticsBackgroundJob<AnalyticsJobArgsDto>>("measureJob",x => x);
Problem:
I can not figure out the syntax for what I need when using the .ContinueWith<???>(???).
This seems to be an XY Problem.
How to start the second background job after the first completes? (Problem X)
There is no support for guaranteeing the sequential and conditional execution of jobs.
That said, background jobs are automatically retried, which allows for Way 3 below.
So, you can effectively achieve that by one of these ways:
Combine the two jobs into one job.
Enqueue the second job inside the first job.
Enqueue both jobs (you don't need to use ContinueWith, await is much neater). In the second job, check if the first job created what it needed to. Otherwise, throw an exception and rely on retry.
What is the syntax for what I need when using ContinueWith? (Solution Y)
There is no syntax for that, other than a less desirable variant of Way 3.
Task.ContinueWith is a C# construct that runs after the EnqueueAsync task, not the actual background job.
The syntax for Way 3 with ContinueWith would be something like:
var measureJobId = await _backgroundJobManager.EnqueueAsync<FileProcessBackgroundJob, FileProcessJobArgsDto>(
new FileProcessJobArgsDto
{
Id = Id,
User = user,
}
)
.ContinueWith(task => _backgroundJobManager.EnqueueAsync<AnalyticsBackgroundJob, AnalyticsJobArgsDto>(
new AnalyticsJobArgsDto("measureJob")
)
.Unwrap();
Compare that to:
var processJobId = await _backgroundJobManager.EnqueueAsync<FileProcessBackgroundJob, FileProcessJobArgsDto>(
new FileProcessJobArgsDto
{
Id = Id,
User = user,
}
);
var measureJobId = await _backgroundJobManager.EnqueueAsync<AnalyticsBackgroundJob, AnalyticsJobArgsDto>(
new AnalyticsJobArgsDto("measureJob")
);
I have a requirement where I have pull data from Sailthru API.The problem is it will take forever if I make a call synchronously as the response time depends on data.I have very much new to threading and tried out something but it didn't seems to work as expected. Could anyone please guide.
Below is my sample code
public void GetJobId()
{
Hashtable BlastIDs = getBlastIDs();
foreach (DictionaryEntry entry in BlastIDs)
{
Hashtable blastStats = new Hashtable();
blastStats.Add("stat", "blast");
blastStats.Add("blast_id", entry.Value.ToString());
//Function call 1
//Thread newThread = new Thread(() =>
//{
GetBlastDetails(entry.Value.ToString());
//});
//newThread.Start();
}
}
public void GetBlastDetails(string blast_id)
{
Hashtable tbData = new Hashtable();
tbData.Add("job", "blast_query");
tbData.Add("blast_id", blast_id);
response = client.ApiPost("job", tbData);
object data = response.RawResponse;
JObject jtry = new JObject();
jtry = JObject.Parse(response.RawResponse.ToString());
if (jtry.SelectToken("job_id") != null)
{
//Function call 2
Thread newThread = new Thread(() =>
{
GetJobwiseDetail(jtry.SelectToken("job_id").ToString(), client,blast_id);
});
newThread.Start();
}
}
public void GetJobwiseDetail(string job_id, SailthruClient client,string blast_id)
{
Hashtable tbData = new Hashtable();
tbData.Add("job_id", job_id);
SailthruResponse response;
response = client.ApiGet("job", tbData);
JObject jtry = new JObject();
jtry = JObject.Parse(response.RawResponse.ToString());
string status = jtry.SelectToken("status").ToString();
if (status != "completed")
{
//Function call 3
Thread.Sleep(3000);
Thread newThread = new Thread(() =>
{
GetJobwiseDetail(job_id, client,blast_id);
});
newThread.Start();
string str = "test sleeping thread";
}
else {
string export_url = jtry.SelectToken("export_url").ToString();
TraceService(export_url);
SaveCSVDataToDB(export_url,blast_id);
}
}
I want the Function call 1 to start asynchronously(or may be after gap of 3 seconds to avoid load on processor). In Function call 3 I am calling the same function again if the status is not completed with delay of 3 seconds to give time for receiving response.
Also correct me if my question sounds stupid.
You should never use Thread.Sleep like that because among others you don't know if 3000ms will be enough and instead of using Thread class you should use Task class which is much better option because it provides some additional features, thread pool managing etc. I don't have access to IDE but you should try to use Task.Factory.StartNew to invoke your request asynchronously then your function GetJobwiseDetail should return some value that you want to save to database and then use .ContinueWith with delegate that should save your function result into database. If you are able to use .NET 4.5 you should try feature async/await.
Easier solution:
Parallel.ForEach
Get some details on the internet about it, you have to know nothing about threading. It will invoke every loop iteration on another thread.
First of all, avoid at all cost starting new Threads in your code; those threads will suck your memory like a collapsing star because each of those gets ~1MB of memory allocated to it.
Now for the code - depending on the framework you can choose from the following:
ThreadPool.QueueUserWorkItem for older versions of .NET Framework
Parallel.ForEach for .NET Framework 4
async and await for .NET Framework 4.5
TPL Dataflow also for .NET Framework 4.5
The code you show fits quite well with Dataflow and also I wouldn't suggest using async/await here because its usage would transform your example is a sort of fire and forget mechanism which is against the recommendations of using async/await.
To use Dataflow you'll need in broad lines:
one TransformBlock which will take a string as an input and will return the response from the API
one BroadcastBlock that will broadcast the response to
two ActionBlocks; one to store data in the database and the other one to call TraceService
The code should look like this:
var downloader = new TransformBlock<string, SailthruResponse>(jobId =>
{
var data = new HashTable();
data.Add("job_id", jobId);
return client.ApiGet("job", data);
});
var broadcaster = new BroadcastBlock<SailthruResponse>(response => response);
var databaseWriter = new ActionBlock<SailthruResponse>(response =>
{
// save to database...
})
var tracer = new ActionBlock<SailthruResponse>(response =>
{
//TraceService() call
});
var options = new DataflowLinkOptions{ PropagateCompletion = true };
// link blocks
downloader.LinkTo(broadcaster, options);
broadcaster.LinkTo(databaseWriter, options);
broadcaster.LinkTo(tracer, options);
// process values
foreach(var value in getBlastIds())
{
downloader.Post(value);
}
downloader.Complete();
try with async as below.
async Task Task_MethodAsync()
{
// . . .
// The method has no return statement.
}
At work one of our processes uses a SQL database table as a queue. I've been designing a queue reader to check the table for queued work, update the row status when work starts, and delete the row when the work is finished. I'm using Parallel.Foreach to give each process its own thread and setting MaxDegreeOfParallelism to 4.
When the queue reader starts up it checks for any unfinished work and loads the work into an list, then it does a Concat with that list and a method that returns an IEnumerable which runs in an infinite loop checking for new work to do. The idea is that the unfinished work should be processed first and then the new work can be worked as threads are available. However what I'm seeing is that FetchQueuedWork will change dozens of rows in the queue table to 'Processing' immediately but only work on a few items at a time.
What I expected to happen was that FetchQueuedWork would only get new work and update the table when a slot opened up in the Parallel.Foreach. What's really odd to me is that it behaves exactly as I would expect when I run the code in my local developer environment, but in production I get the above problem.
I'm using .Net 4. Here is the code:
public void Go()
{
List<WorkData> unfinishedWork = WorkData.LoadUnfinishedWork();
IEnumerable<WorkData> work = unfinishedWork.Concat(FetchQueuedWork());
Parallel.ForEach(work, new ParallelOptions { MaxDegreeOfParallelism = 4 }, DoWork);
}
private IEnumerable<WorkData> FetchQueuedWork()
{
while (true)
{
var workUnit = WorkData.GetQueuedWorkAndSetStatusToProcessing();
yield return workUnit;
}
}
private void DoWork(WorkData workUnit)
{
if (!workUnit.Loaded)
{
System.Threading.Thread.Sleep(5000);
return;
}
Work();
}
I suspect that the default (Release mode?) behaviour is to buffer the input. You might need to create your own partitioner and pass it the NoBuffering option:
List<WorkData> unfinishedWork = WorkData.LoadUnfinishedWork();
IEnumerable<WorkData> work = unfinishedWork.Concat(FetchQueuedWork());
var options = new ParallelOptions { MaxDegreeOfParallelism = 4 };
var partitioner = Partitioner.Create(work, EnumerablePartitionerOptions.NoBuffering);
Parallel.ForEach(partioner, options, DoWork);
Blorgbeard's solution is correct when it comes to .NET 4.5 - hands down.
If you are constrained to .NET 4, you have a few options:
Replace your Parallel.ForEach with work.AsParallel().WithDegreeOfParallelism(4).ForAll(DoWork). PLINQ is more conservative when it comes to buffering items, so this should do the trick.
Write your own enumerable partitioner (good luck).
Create a grotty semaphore-based hack such as this:
(Side-effecting Select used for the sake of brevity)
public void Go()
{
const int MAX_DEGREE_PARALLELISM = 4;
using (var semaphore = new SemaphoreSlim(MAX_DEGREE_PARALLELISM, MAX_DEGREE_PARALLELISM))
{
List<WorkData> unfinishedWork = WorkData.LoadUnfinishedWork();
IEnumerable<WorkData> work = unfinishedWork
.Concat(FetchQueuedWork())
.Select(w =>
{
// Side-effect: bad practice, but easier
// than writing your own IEnumerable.
semaphore.Wait();
return w;
});
// You still need to specify MaxDegreeOfParallelism
// here so as not to saturate your thread pool when
// Parallel.ForEach's load balancer kicks in.
Parallel.ForEach(work, new ParallelOptions { MaxDegreeOfParallelism = MAX_DEGREE_PARALLELISM }, workUnit =>
{
try
{
this.DoWork(workUnit);
}
finally
{
semaphore.Release();
}
});
}
}
I am educating myself on Parallel.Invoke, and parallel processing in general, for use in current project. I need a push in the right direction to understand how you can dynamically\intelligently allocate more parallel 'threads' as required.
As an example. Say you are parsing large log files. This involves reading from file, some sort of parsing of the returned lines and finally writing to a database.
So to me this is a typical problem that can benefit from parallel processing.
As a simple first pass the following code implements this.
Parallel.Invoke(
()=> readFileLinesToBuffer(),
()=> parseFileLinesFromBuffer(),
()=> updateResultsToDatabase()
);
Behind the scenes
readFileLinesToBuffer() reads each line and stores to a buffer.
parseFileLinesFromBuffer comes along and consumes lines from buffer and then let's say it put them on another buffer so that updateResultsToDatabase() can come along and consume this buffer.
So the code shown assumes that each of the three steps uses the same amount of time\resources but lets say the parseFileLinesFromBuffer() is a long running process so instead of running just one of these methods you want to run two in parallel.
How can you have the code intelligently decide to do this based on any bottlenecks it might perceive?
Conceptually I can see how some approach of monitoring the buffer sizes might work, spawning a new 'thread' to consume the buffer at an increased rate for example...but I figure this type of issue has been considered in putting together the TPL library.
Some sample code would be great but I really just need a clue as to what concepts I should investigate next. It looks like maybe the System.Threading.Tasks.TaskScheduler holds the key?
Have you tried the Reactive Extensions?
http://msdn.microsoft.com/en-us/data/gg577609.aspx
The Rx is a new tecnology from Microsoft, the focus as stated in the official site:
The Reactive Extensions (Rx)... ...is a library to compose
asynchronous and event-based programs using observable collections and
LINQ-style query operators.
You can download it as a Nuget package
https://nuget.org/packages/Rx-Main/1.0.11226
Since I am currently learning Rx I wanted to take this example and just write code for it, the code I ended up it is not actually executed in parallel, but it is completely asynchronous, and guarantees the source lines are executed in order.
Perhaps this is not the best implementation, but like I said I am learning Rx, (thread-safe should be a good improvement)
This is a DTO that I am using to return data from the background threads
class MyItem
{
public string Line { get; set; }
public int CurrentThread { get; set; }
}
These are the basic methods doing the real work, I am simulating the time with a simple Thread.Sleep and I am returning the thread used to execute each method Thread.CurrentThread.ManagedThreadId. Note the timer of the ProcessLine it is 4 sec, it's the most time-consuming operation
private IEnumerable<MyItem> ReadLinesFromFile(string fileName)
{
var source = from e in Enumerable.Range(1, 10)
let v = e.ToString()
select v;
foreach (var item in source)
{
Thread.Sleep(1000);
yield return new MyItem { CurrentThread = Thread.CurrentThread.ManagedThreadId, Line = item };
}
}
private MyItem UpdateResultToDatabase(string processedLine)
{
Thread.Sleep(700);
return new MyItem { Line = "s" + processedLine, CurrentThread = Thread.CurrentThread.ManagedThreadId };
}
private MyItem ProcessLine(string line)
{
Thread.Sleep(4000);
return new MyItem { Line = "p" + line, CurrentThread = Thread.CurrentThread.ManagedThreadId };
}
The following method I am using it just to update the UI
private void DisplayResults(MyItem myItem, Color color, string message)
{
this.listView1.Items.Add(
new ListViewItem(
new[]
{
message,
myItem.Line ,
myItem.CurrentThread.ToString(),
Thread.CurrentThread.ManagedThreadId.ToString()
}
)
{
ForeColor = color
}
);
}
And finally this is the method that calls the Rx API
private void PlayWithRx()
{
// we init the observavble with the lines read from the file
var source = this.ReadLinesFromFile("some file").ToObservable(Scheduler.TaskPool);
source.ObserveOn(this).Subscribe(x =>
{
// for each line read, we update the UI
this.DisplayResults(x, Color.Red, "Read");
// for each line read, we subscribe the line to the ProcessLine method
var process = Observable.Start(() => this.ProcessLine(x.Line), Scheduler.TaskPool)
.ObserveOn(this).Subscribe(c =>
{
// for each line processed, we update the UI
this.DisplayResults(c, Color.Blue, "Processed");
// for each line processed we subscribe to the final process the UpdateResultToDatabase method
// finally, we update the UI when the line processed has been saved to the database
var persist = Observable.Start(() => this.UpdateResultToDatabase(c.Line), Scheduler.TaskPool)
.ObserveOn(this).Subscribe(z => this.DisplayResults(z, Color.Black, "Saved"));
});
});
}
This process runs totally in the background, this is the output generated:
in an async/await world, you'd have something like:
public async Task ProcessFileAsync(string filename)
{
var lines = await ReadLinesFromFileAsync(filename);
var parsed = await ParseLinesAsync(lines);
await UpdateDatabaseAsync(parsed);
}
then a caller could just do var tasks = filenames.Select(ProcessFileAsync).ToArray(); and whatever (WaitAll, WhenAll, etc, depending on context)
Use a couple of BlockingCollection. Here is an example
The idea is that you create a producer that puts data into the collection
while (true) {
var data = ReadData();
blockingCollection1.Add(data);
}
Then you create any number of consumers that reads from the collection
while (true) {
var data = blockingCollection1.Take();
var processedData = ProcessData(data);
blockingCollection2.Add(processedData);
}
and so on
You can also let TPL handle the number of consumers by using Parallel.Foreach
Parallel.ForEach(blockingCollection1.GetConsumingPartitioner(),
data => {
var processedData = ProcessData(data);
blockingCollection2.Add(processedData);
});
(note that you need to use GetConsumingPartitioner not GetConsumingEnumerable (see here)
Is there any change that a multiple Background Workers perform better than Tasks on 5 second running processes? I remember reading in a book that a Task is designed for short running processes.
The reasong I ask is this:
I have a process that takes 5 seconds to complete, and there are 4000 processes to complete. At first I did:
for (int i=0; i<4000; i++) {
Task.Factory.StartNewTask(action);
}
and this had a poor performance (after the first minute, 3-4 tasks where completed, and the console application had 35 threads). Maybe this was stupid, but I thought that the thread pool will handle this kind of situation (it will put all actions in a queue, and when a thread is free, it will take an action and execute it).
The second step now was to do manually Environment.ProcessorCount background workers, and all the actions to be placed in a ConcurentQueue. So the code would look something like this:
var workers = new List<BackgroundWorker>();
//initialize workers
workers.ForEach((bk) =>
{
bk.DoWork += (s, e) =>
{
while (toDoActions.Count > 0)
{
Action a;
if (toDoActions.TryDequeue(out a))
{
a();
}
}
}
bk.RunWorkerAsync();
});
This performed way better. It performed much better then the tasks even when I had 30 background workers (as much tasks as in the first case).
LE:
I start the Tasks like this:
public static Task IndexFile(string file)
{
Action<object> indexAction = new Action<object>((f) =>
{
Index((string)f);
});
return Task.Factory.StartNew(indexAction, file);
}
And the Index method is this one:
private static void Index(string file)
{
AudioDetectionServiceReference.AudioDetectionServiceClient client = new AudioDetectionServiceReference.AudioDetectionServiceClient();
client.IndexCompleted += (s, e) =>
{
if (e.Error != null)
{
if (FileError != null)
{
FileError(client,
new FileIndexErrorEventArgs((string)e.UserState, e.Error));
}
}
else
{
if (FileIndexed != null)
{
FileIndexed(client, new FileIndexedEventArgs((string)e.UserState));
}
}
};
using (IAudio proxy = new BassProxy())
{
List<int> max = new List<int>();
if (proxy.ReadFFTData(file, out max))
{
while (max.Count > 0 && max.First() == 0)
{
max.RemoveAt(0);
}
while (max.Count > 0 && max.Last() == 0)
{
max.RemoveAt(max.Count - 1);
}
client.IndexAsync(max.ToArray(), file, file);
}
else
{
throw new CouldNotIndexException(file, "The audio proxy did not return any data for this file.");
}
}
}
This methods reads from an mp3 file some data, using the Bass.net library. Then that data is sent to a WCF service, using the async method.
The IndexFile(string file) method, which creates tasks is called for 4000 times in a for loop.
Those two events, FileIndexed and FileError are not handled, so they are never thrown.
The reason why the performance for Tasks was so poor was because you mounted too many small tasks (4000). Remember the CPU needs to schedule the tasks as well, so mounting a lots of short-lived tasks causes extra work load for CPU. More information can be found in the second paragraph of TPL:
Starting with the .NET Framework 4, the TPL is the preferred way to
write multithreaded and parallel code. However, not all code is
suitable for parallelization; for example, if a loop performs only a
small amount of work on each iteration, or it doesn't run for many
iterations, then the overhead of parallelization can cause the code to
run more slowly.
When you used the background workers, you limited the number of possible alive threads to the ProcessCount. Which reduced a lot of scheduling overhead.
Given that you have a strictly defined list of things to do, I'd use the Parallel class (either For or ForEach depending on what suits you better). Furthermore you can pass a configuration parameter to any of these methods to control how many tasks are actually performed at the same time:
System.Threading.Tasks.Parallel.For(0, 20000, new ParallelOptions() { MaxDegreeOfParallelism = 5 }, i =>
{
//do something
});
The above code will perform 20000 operations, but will NOT perform more than 5 operations at the same time.
I SUSPECT the reason the background workers did better for you was because you had them created and instantiated at the start, while in your sample Task code it seems you're creating a new Task object for every operation.
Alternatively, did you think about using a fixed number of Task objects instantiated at the start and then performing a similar action with a ConcurrentQueue like you did with the background workers? That should also prove to be quite efficient.
Have you considered using threadpool?
http://msdn.microsoft.com/en-us/library/system.threading.threadpool.aspx
If your performance is slower when using threads, it can only be due to threading overhead (allocating and destroying individual threads).