TPL: Dispose processed items - c#

In C#, I am using Task Parallel Library (TPL) to download an image, process the image, and save the analysis results. A simplified code reads as the following.
var getImage = new TransformBlock<int, Image>(GetImage);
var proImage = new TransformBlock<Image, double>(ProcessImage);
var saveRes = new ActionBlock<double>(SaveResult);
var linkOptions = new DataflowLinkOptions() { PropagateCompletion = true };
getImage.LinkTo(proImage, linkOptions);
proImage.LinkTo(SaveRes, linkOptions);
for (int x = 0; x < 1000000; x++)
getImage.Post(x);
getImage.Complete();
SaveRes.Completion.Wait();
This works as expected, except for memory usage. I am expecting int_x, image_x, and double_x to be disposed when the pipeline has processed that iteration. In other words, I am expecting every resource created during the execution of getImage, proImage, and saveRes for iteration x be disposed when the last block completes its execution. However, this implementation keeps all the objects in the memory until I exit the scope of TPL.
Am I missing something? is this the expected behavior of TPL? and is there any option to set so the resources are released at the end of each iteration?
Update
Following the suggestion in the comments, I rewrote the code using BufferBlock and SendAsync as the following. However, I do not think it leads to claiming the resources consumed by each task. Setting the BoundedCapacity only causes my program to halt at a point where I believe it has reached the limit set to the BoundedCapacity.
var blockOpts = new DataflowBlockOptions()
{ BoundedCapacity = 100 };
var imgBuffer = new BufferBlock<int>(blockOpts);
var getImage = new TransformBlock<int, Image>(GetImage, blockOpts);
var proImage = new TransformBlock<Image, double>(ProcessImage, blockOpts);
var SaveRes = new ActionBlock<double>(SaveResult, blockOpts);
var linkOptions = new DataflowLinkOptions() { PropagateCompletion = true };
imgBuffer.LinkTo(getImage, linkOptions);
getImage.LinkTo(proImage, linkOptions);
proImage.LinkTo(SaveRes, linkOptions);
for (int x = 0; x < 1000000; x++)
await imgBuffer.SendAsync(x);
getImage.Complete();
SaveRes.Completion.Wait();

is this the expected behavior of TPL?
Yes. It doesn't root all the objects (they are available for garbage collection and finalization), but it does not dispose them, either.
and is there any option to set so the resources are released at the end of each iteration?
No.
how can I can make sure dispose is auto called when the last block/action executed on an input?
To dispose objects, your code should call Dispose. This is fairly easily done by modifying ProcessImage or wrapping it in a delegate.
If ProcessImage is synchronous:
var proImage = new TransformBlock<Image, double>(image => { using (image) return ProcessImage(image); });
or if it's asynchronous:
var proImage = new TransformBlock<Image, double>(async image => { using (image) return await ProcessImage(image); });

Related

Report when input to first dataflow block finishes all linked blocks

I am using TPL Dataflow to download data from a ticketing system.
The system takes the ticket number as the input, calls an API and receives a nested JSON response with various information. Once received, a set of blocks handles each level of the nested structure and writes it to a relational database. e.g. Conversation, Conversation Attachments, Users, User Photos, User Tags, etc
Json
{
"conversations":[
{
"id":12345,
"author_id":23456,
"body":"First Conversation"
},
{
"id":98765,
"authorid":34567,
"body":"Second Conversation",
"attachments":[
{
"attachmentid":12345
"attachment_name":"Test.jpg"
}
}
],
"users":[
{
"userid":12345
"user_name":"John Smith"
},
{
"userid":34567
"user_name":"Joe Bloggs"
"user_photo":
{
"photoid":44556,
"photo_name":"headshot.jpg"
},
tags:[
"development",
"vip"
]
}
]
Code
Some blocks need to broadcast so that deeper nesting can still have access to the data. e.g. UserModelJson is broadcast so that 1 block can handle writing the user, 1 block can handle writing the User Tags and 1 block can handle writing the User Photos.
var loadTicketsBlock = new TransformBlock<int, ConversationsModelJson>(async ticketNumber => await p.GetConversationObjectFromTicket(ticketNumber));
var broadcastConversationsObjectBlock = new BroadcastBlock<ConversationsModelJson>(conversations => conversations);
//Conversation
var getConversationsFromConversationObjectBlock = new TransformManyBlock<ConversationsModelJson, ConversationModelJson>(conversation => ModelConverter.ConvertConversationsObjectJsonToConversationJson(conversation));
var convertConversationsBlock = new TransformBlock<ConversationModelJson, ConversationModel>(conversation => ModelConverter.ConvertConversationJsonToConversation(conversation));
var batchConversionBlock = new BatchBlock<ConversationModel>(batchBlockSize);
var convertConversationsToDTBlock = new TransformBlock<IEnumerable<ConversationModel>, DataTable>(conversations => ModelConverter.ConvertConversationModelToConversationDT(conversations));
var writeConversationsBlock = new ActionBlock<DataTable>(async conversations => await p.ProcessConversationsAsync(conversations));
var getUsersFromConversationsBlock = new TransformManyBlock<ConversationsModelJson, UserModelJson>(conversations => ModelConverter.ConvertConversationsJsonToUsersJson(conversations));
var broadcastUserBlock = new BroadcastBlock<UserModelJson>(userModelJson => userModelJson);
//User
var convertUsersBlock = new TransformBlock<UserModelJson, UserModel>(user => ModelConverter.ConvertUserJsonToUser(user));
var batchUsersBlock = new BatchBlock<UserModel>(batchBlockSize);
var convertUsersToDTBlock = new TransformBlock<IEnumerable<UserModel>, DataTable>(users => ModelConverter.ConvertUserModelToUserDT(users));
var writeUsersBlock = new ActionBlock<DataTable>(async users => await p.ProcessUsersAsync(users));
//UserTag
var getUserTagsFromUserBlock = new TransformBlock<UserModelJson, UserTagModel>(user => ModelConverter.ConvertUserJsonToUserTag(user));
var batchTagsBlock = new BatchBlock<UserTagModel>(batchBlockSize);
var convertTagsToDTBlock = new TransformBlock<IEnumerable<UserTagModel>, DataTable>(tags => ModelConverter.ConvertUserTagModelToUserTagDT(tags));
var writeTagsBlock = new ActionBlock<DataTable>(async tags => await p.ProcessUserTagsAsync(tags));
DataflowLinkOptions linkOptions = new DataflowLinkOptions()
{
PropagateCompletion = true
};
loadTicketsBlock.LinkTo(broadcastConversationsObjectBlock, linkOptions);
//Conversation
broadcastConversationsObjectBlock.LinkTo(getConversationsFromConversationObjectBlock, linkOptions);
getConversationsFromConversationObjectBlock.LinkTo(convertConversationsBlock, linkOptions);
convertConversationsBlock.LinkTo(batchConversionBlock, linkOptions);
batchConversionBlock.LinkTo(convertConversationsToDTBlock, linkOptions);
convertConversationsToDTBlock.LinkTo(writeConversationsBlock, linkOptions);
var tickets = await provider.GetAllTicketsAsync();
foreach (var ticket in tickets)
{
cts.Token.ThrowIfCancellationRequested();
await loadTicketsBlock.SendAsync(ticket.TicketID);
}
loadTicketsBlock.Complete();
The LinkTo blocks are repeated for each type of data to be written.
I know when the whole pipeline is complete by using
await Task.WhenAll(<Last block of each branch>.Completion);
but if I pass in ticket number 1 into the loadTicketsBlock block then how do I know when that specific ticket has been through all blocks in the pipeline and therefore is complete?
The reason that I want to know this is so that I can report to the UI that ticket 1 of 100 is complete.
You could consider using the TaskCompletionSource as the base class for all your sub-entities. For example:
class Attachment : TaskCompletionSource
{
}
class Conversation : TaskCompletionSource
{
}
Then every time you insert a sub-entity in the database, you mark it as completed:
attachment.SetResult();
...or if the insert fails, mark it as faulted:
attachment.SetException(ex);
Finally you can combine all the asynchronous completions in one, with the method Task.WhenAll:
Task ticketCompletion = Task.WhenAll(Enumerable.Empty<Task>()
.Append(ticket.Task)
.Concat(attachments.Select(e => e.Task))
.Concat(conversations.Select(e => e.Task)));
If I am tracking progress in Dataflow, usually I will set up the last block as a notify the UI of progress type block. To be able to track the progress of your inputs, you will need to keep the context of the original input in all the objects you are passing around, so in this case you need to be able to tell that you are working on ticket 1 all the way through your pipeline, and if one of your transforms removes the context that it is working on ticket 1, then you will need to rethink the object types that you are passing through your pipeline so you can keep that context.
A simple example of what I'm talking about is laid out below with a broadcast block going to three transform blocks, and then all three transform blocks going back to an action block that notifies about the progress of the pipelines.
When combining back into the single action block you need to make sure not to propagate completion at that point because as soon as one block propagates completion to the action block, that action block will stop accepting input, so you will still wait for the last block of each pipeline to complete, and then after that manually complete your final notify the UI action block.
using System;
using System.Threading.Tasks.Dataflow;
using System.Threading.Tasks;
using System.Collections.Generic;
public class Program
{
public static void Main()
{
var broadcastBlock = new BroadcastBlock<string>(x => x);
var transformBlockA = new TransformBlock<string, string>(x =>
{
return x + "A";
});
var transformBlockB = new TransformBlock<string, string>(x =>
{
return x + "B";
});
var transformBlockC = new TransformBlock<string, string>(x =>
{
return x + "C";
});
var ticketTracking = new Dictionary<int, List<string>>();
var notifyUiBlock = new ActionBlock<string>(x =>
{
var ticketNumber = int.Parse(x.Substring(5,1));
var taskLetter = x.Substring(7,1);
var success = ticketTracking.TryGetValue(ticketNumber, out var tasksComplete);
if (!success)
{
tasksComplete = new List<string>();
ticketTracking[ticketNumber] = tasksComplete;
}
tasksComplete.Add(taskLetter);
if (tasksComplete.Count == 3)
{
Console.WriteLine($"Ticket {ticketNumber} is complete");
}
});
DataflowLinkOptions linkOptions = new DataflowLinkOptions() {PropagateCompletion = true};
broadcastBlock.LinkTo(transformBlockA, linkOptions);
broadcastBlock.LinkTo(transformBlockB, linkOptions);
broadcastBlock.LinkTo(transformBlockC, linkOptions);
transformBlockA.LinkTo(notifyUiBlock);
transformBlockB.LinkTo(notifyUiBlock);
transformBlockC.LinkTo(notifyUiBlock);
for(var i = 0; i < 5; i++)
{
broadcastBlock.Post($"Task {i} ");
}
broadcastBlock.Complete();
Task.WhenAll(transformBlockA.Completion, transformBlockB.Completion, transformBlockC.Completion).Wait();
notifyUiBlock.Complete();
notifyUiBlock.Completion.Wait();
Console.WriteLine("Done");
}
}
This will give an output similar to this
Ticket 0 is complete
Ticket 1 is complete
Ticket 2 is complete
Ticket 3 is complete
Ticket 4 is complete
Done

Backpressure via BufferBlock not working. (C# TPL Dataflow)

Typical situation: Fast producer, slow consumer, need to slow producer down.
Sample code that doesn't work as I expected (explained below):
// I assumed this block will behave like BlockingCollection, but it doesn't
var bb = new BufferBlock<int>(new DataflowBlockOptions {
BoundedCapacity = 3, // looks like this does nothing
});
// fast producer
int dataSource = -1;
var producer = Task.Run(() => {
while (dataSource < 10) {
var message = ++dataSource;
bb.Post(message);
Console.WriteLine($"Posted: {message}");
}
Console.WriteLine("Calling .Complete() on buffer block");
bb.Complete();
});
// slow consumer
var ab = new ActionBlock<int>(i => {
Thread.Sleep(500);
Console.WriteLine($"Received: {i}");
}, new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism = 2,
});
bb.LinkTo(ab);
ab.Completion.Wait();
How I thought this code would work, but it doesn't:
BufferBlock bb is the blocking queue with capacity of 3. Once capacity reached, producer should not be able to .Post() to it, until there's a vacant slot.
Doesn't work like that. bb seems to happily accept any number of messages.
producer is a task that quickly Posts messages. Once all messages have been posted, the call to bb.Complete() should propagate through the pipeline and signal shutdown once all messages have been processed. Hence waiting ab.Completion.Wait(); at the end.
Doesn't work either. As soon as .Complete() is called, action block ab won't receive any more messages.
Can be done with a BlockingCollection, which I thought in TPL Dataflow (TDF) world BufferBlock was the equivalent of. I guess I'm misunderstanding how backpressure is supposed to work in TPL Dataflow.
So where's the catch? How to run this pipeline, not allowing more than 3 messages in the buffer bb, and wait for its completion?
PS: I found this gist (https://gist.github.com/mnadel/df2ec09fe7eae9ba8938) where it's suggested to maintain a semaphore to block writing to BufferBlock. I thought this was "built-in".
Update after accepting an answer:
Updates after accepting the answer:
If you're looking at this question, you need to remember that ActionBlock also has its own input buffer.
That's for one. Then you also need to realize, that because all blocks have their own input buffers you don't need the BufferBlock for what you might think its name implied. A BufferBlock is more like a utility block for more complex architectures or like a balance loading block. But it's not a backpressure buffer.
Completion propagation needs to be dfined at link level explicitly.
When calling .LinkTo() need to explicitly pass new DataflowLinkOptions {PropagateCompletion = true} as the 2nd argument.
To introduce back pressure you need use SendAsync when you send items into the block. This allows your producer to wait for the block to be ready for the item. Something like this is what you're looking for:
class Program
{
static async Task Main()
{
var options = new ExecutionDataflowBlockOptions()
{
BoundedCapacity = 3
};
var block = new ActionBlock<int>(async i =>
{
await Task.Delay(100);
Console.WriteLine(i);
}, options);
//Producer
foreach (var i in Enumerable.Range(0, 10))
{
await block.SendAsync(i);
}
block.Complete();
await block.Completion;
}
}
If you change this to use Post and print the result of the Post you'll see that many items fail to be passed to the block:
class Program
{
static async Task Main()
{
var options = new ExecutionDataflowBlockOptions()
{
BoundedCapacity = 1
};
var block = new ActionBlock<int>(async i =>
{
await Task.Delay(1000);
Console.WriteLine(i);
}, options);
//Producer
foreach (var i in Enumerable.Range(0, 10))
{
var result = block.Post(i);
Console.WriteLine(result);
}
block.Complete();
await block.Completion;
}
}
Output:
True
False
False
False
False
False
False
False
False
False
0
With the guidance from JSteward's answer, I came up with the following code.
It produces (reads etc.) new items concurrently with processing said items, maintaining a read-ahead buffer.
The completion signal is sent to the head of the chain when the "producer" has no more items.
The program also awaits the completion of the whole chain before terminating.
static async Task Main() {
string Time() => $"{DateTime.Now:hh:mm:ss.fff}";
// the buffer is added to the chain just for demonstration purposes
// the chain would work fine using just the built-in input buffer
// of the `action` block.
var buffer = new BufferBlock<int>(new DataflowBlockOptions {BoundedCapacity = 3});
var action = new ActionBlock<int>(async i =>
{
Console.WriteLine($"[{Time()}]: Processing: {i}");
await Task.Delay(500);
}, new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = 2, BoundedCapacity = 2});
// it's necessary to set `PropagateCompletion` property
buffer.LinkTo(action, new DataflowLinkOptions {PropagateCompletion = true});
//Producer
foreach (var i in Enumerable.Range(0, 10))
{
Console.WriteLine($"[{Time()}]: Ready to send: {i}");
await buffer.SendAsync(i);
Console.WriteLine($"[{Time()}]: Sent: {i}");
}
// we call `.Complete()` on the head of the chain and it's propagated forward
buffer.Complete();
await action.Completion;
}

Do not wait for for class instance to initialize and continue with the code c#

I'm quite new with programming so the question might look stupid for you:
for (int i = 0; i < 20; i++)
{
driver[i] = new ChromeDriver(service, options[i]);
}
I'm trying to make a webbot. It has to open multiple chrome windows (which works fine)
the part when he opens Chrome is on line 3. Each opening takes like 5-6 seconds.
So my question is, when I initialize a new instance of ChromeDriver, can I continue with the initialization of another ChromeDriver instance (or with other code) even if the first initialization is not done?
tldr; how to initialize multiple instances at the same time if the initialization takes time.
I appreciate any help
You can run multiple threads in C#, which means that you can run your code in parallel, not waiting for one method to complete for other to run.
The recommended way to do that is using methods from System.Threading.Tasks namespace, most notably Task.Run.
Though in your case I would recommend to think of parallel for:
Parallel.For(0, 20, i =>
{
driver[i] = new ChromeDriver(service, options[i]);
});
you can use Task.Run
var task = Task.Run(() => new ChromeDriver(service, options[i]));
The overall code will look like this:
var taskList = new List<Task>(20);
for (int i = 0; i < 20; i++)
{
var task = new ChromeDriver(service, options[i]);
taskList.Add(task);
}
//at the and wait all tasks to finish.
await Task.WaitAll(taskList.ToArray());

C# ActionBlock Equivalent that collects results

I'm currently using an ActionBlock to process serially started asynchronous jobs. It works very well for processing each item Posted to it, but there is no way to collect a list of the results from each job.
What can I use to collect the results of my jobs in a thread safe manner?
My code is currently something like this:
var actionBlock = new ActionBlock<int> (async i => await Process(i));
for(int i = 0; i < 100; i++)
{
actionBlock.Post(i);
}
actionBlock.Complete();
await actionBlock.Completion;
I've tried using a TransformBlock instead, but it hangs indefinitely when awaiting the Completion. The completion's status is "WaitingForActivation".
My code with the TransformBlock is something like this:
var transformBlock = new TransformBlock<int, string> (async i => await Process(i));
for(int i = 0; i < 100; i++)
{
actionBlock.Post(i);
}
actionBlock.Complete();
await actionBlock.Completion;
transformBlock.TryReceiveAll(out IList<string> strings);
It turns out a ConcurrentBag is the answer
var bag = new ConcurrentBag<string>();
var actionBlock = new ActionBlock<int> (async i =>
bag.Add(await Process(i))
);
for(int i = 0; i < 100; i++)
{
actionBlock.Post(i);
}
actionBlock.Complete();
await actionBlock.Completion;
Now 'bag' has all the results in it, and can be accessed as an IEnumerable.
The code I've actually ended up using uses a Parallel.ForEach instead of the ActionBlock.
Parallel.ForEach
(
inputData,
i => bag.Add(await Process(i))
);
This is quite a lot simpler, but seems about as good for performance and still has options to limit the degree of parallelism etc.

c# - create thread in for loop (arguement out of range exception)

I don't know how to describe this problem precisely. Let's look at my code.
for (int i = 0; i < myMT.Keys[key_indexer].Count; i++)
{
threads.Add(new Thread(
() =>
{
sounds[myMT.Keys[key_indexer][i]].PlayLooping();
}
));
threads[threads.Count - 1].Start();
}
Note: sounds is a list of SoundPlayers
The initialization of threads and myMT:
List<Thread> threads = null;
MusicTransfer myMT=null;
and in the constructor:
threads = new List<Thread>();
myMT = new MusicTransfer(bubblePanel);
The variable Keys in myMT is with type of List<List<int>>. It is initialized with the same way of myMT and threads. Imagine a matrix, the outer list is a list of rows and the inner one is for each cell.
When I run the program, I set myMT.Keys[key_indexer].Count to 1. So, normally, the for loop should stop when i reach 1.
However, it throws an exception of ArgumentOutOfRange at the line of sounds[myMT.Keys[key_indexer][i]].PlayLooping(). So, I used debugger to check the value of each variable.
What I found are:
If I use "step over" check step by step, which means time is consumed quite much after the new thread runs, for loop will stop when i reaches 1, which is the way it should be.
If I click "continue" after the breakpoint triggered, the for loop is still processing after i equals 1.
the break point should always be set at the line of threads.Add(new Thread(. If it is set at the line of sounds[myMT.Keys[key_indexer][i]].PlayLooping();, the exception will be triggered even after "step over"
I guess the problem is about thread, but have no idea how to solve it.
Thanks for any help!
There is so many things wrong with your post, however maybe this will help you out a bit
Note : Make your code readable, trust me it does wonders
// List of threads
var threads = new List<Thread>();
// Lets stop indexing everything and make it easy for ourselves
var someList = myMT.Keys[key_indexer];
for (var i = 0; i < someList.Count; i++)
{
// we need to create a reference to the indexed value
// in the someList, otherwise there is no gaurentee
// the thread will have the right index when it needs it
// (thank me later)
var someSound = someList[i];
// create a thread and your callback
var thread = new Thread(() => someSound.PlayLooping());
// add thread to the list
threads.Add(thread);
}
// now lets start the treads in a nice orderly fashion
foreach (var thread in threads)
{
thread.Start();
}
Another way to do this with Tasks
var tasks = new List<Task>();
var someList = myMT.Keys[key_indexer];
for (var i = 0; i < someList.Count; i++)
{
var someSound = someList[1];
var task = new Task(() => someSound.PlayLooping());
tasks.Add(task);
task.Start();
}
Task.WaitAll(tasks.ToArray());
Disclaimer : i take no responsibility for your other logic problems, this was for pure morbid academic purposes

Categories