Dataflowblock stops updating UI but still runs the action? - c#

This issue is really hard to debug, not always happens (not happen in a short time so that I can just debug the code easily) and looks like no one out there has had the similar issue like this? (I've googled for hours without finding anything related to this issue).
In a short word, my dataflow network works fine at some point until I find out that the terminal block (which updates the UI) seems to stop working (no new data updated on the UI) whereas all the upwards dataflow blocks are still working fine. So it's like there is some disconnection between the other blocks and the ui block here.
Here is my detailed dataflow network, let's check out first before I'm going to explain more about the issue:
//the network graph first
[raw data block]
-> [switching block] -> [data counting block]
-> [processing block] -> [ok result block] -> [completion monitoring]
-> [not ok result block] -> [completion monitoring]
//in the UI code behind where I can consume the network and plug-in some other blocks for updating
//like this:
[ok result block] -> [ok result counting block]
[not ok result block] -> [other ui updating]
The block [ok result block] is a BroadcastBlock which pushes result to the [ok result counting block]. The issue I've described partly here is this [ok result counting block] seems to be disconnected from [ok result block].
var options = new DataflowBlockOptions { EnsureOrdered = false };
var execOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 80 };
//[raw data block]
var rawDataBlock = new BufferBlock<Input>(options);
//[switching block]
var switchingBlock = new TransformManyBlock<Input,Input>(e => new[] {e,null});
//[data counting block]
var dataCountingBlock = new BroadcastBlock<Input>(null);
//[processing block]
var processingBlock = new TransformBlock<Input,int>(async e => {
//call another api to compute the result
var result = await …;
//rollback the input for later processing (some kind of retry)
if(result < 0){
//per my logging, there is only one call dropping
//in this case
Task.Run(rollback);
}
//local function to rollback
async Task rollback(){
await rawDataBlock.SendAsync(e).ConfigureAwait(false);
}
return result;
}, execOptions);
//[ok result block]
var okResultBlock = new BroadcastBlock<int>(null, options);
//[not ok result block]
var notOkResultBlock = new BroadcastBlock<int>(null, options);
//[completion monitoring]
var completionMonitoringBlock = new ActionBlock<int>(e => {
if(rawDataBlock.Completion.IsCompleted && processingBlock.InputCount == 0){
processingBlock.Complete();
}
}, execOptions);
//connect the blocks to build the network
rawDataBlock.LinkTo(switchingBlock);
switchingBlock.LinkTo(processingBlock, e => e != null);
switchingBlock.LinkTo(dataCountingBlock, e => e == null);
processingBlock.LinkTo(okResultBlock, e => e >= 9);
processingBlock.LinkTo(notOkResultBlock, e => e < 9);
okResultBlock.LinkTo(completionMonitoringBlock);
notOkResultBlock.LinkTo(completionMonitoringBlock);
In the UI code behind, I plug in some other UI blocks to update the info. Here I'm using WPF but I think it does not matter here:
var uiBlockOptions = new ExecutionDataflowBlockOptions {
TaskScheduler = TaskScheduler.FromCurrentSynchronizationContext()
};
dataCountingBlock.LinkTo(new ActionBlock<int>(e => {
//these are properties in the VM class, which is bound to the UI (xaml view)
RawInputCount++;
}, uiBlockOptions));
okResultBlock.LinkTo(new ActionBlock<int>(e => {
//these are properties in the VM class, which is bound to the UI (xaml view)
ProcessedCount++;
OkResultCount++;
}, uiBlockOptions));
notOkResultBlock.LinkTo(new ActionBlock<int>(e => {
//these are properties in the VM class, which is bound to the UI (xaml view)
ProcessedCount++;
PendingCount = processingBlock.InputCount;
}, uiBlockOptions));
I do have code monitoring the completion status of the blocks: rawDataBlock, processingBlock, okResultBlock, notOkResultBlock.
I also have other logging code inside the processingBlock to help diagnosing.
So as I said, after some fairly long time (about 1 hour with about 600K items processed, actually this number says nothing about the issue, it could be random), the network seems to still run fine except that some counts (ok result, not ok result) are not updated, as if the okResultBlock and notOkResultBlock were disconnected from the processingBlock OR they were disconnected from the UI blocks (which updates the UI). I ensure that the processingBlock is still working (no exception logged and the results are still written to file), the dataCountingBlock is still working well (with new count updated on the UI), all the blocks processingBlock, okResultBlock, notOkResultBlock are not completed (their completions are .ContinueWith a task which logs out the status and nothing logged).
So it's really stuck there. I don't have any clue about why it could stop working like that. This could only happen when we use a black-box library like TPL Dataflow. I know it may also be hard for you to diagnose, imagine and think about possibilities. I just asked here for suggestions to solve this as well as any shared experience (about the similar issues) from you and possibly some guesses about what could cause such kind of issue in TPL Dataflow
UPDATE:
I've successfully reproduced the bug one more time and before I had prepared some code to write down some info to help debugging. The issue now keeps down to this point: The processingBlock somehow does not actually push/post/send any msg to all the linked blocks (including the okResultBlock and notOkResultBlock) AND even a new block (prepended with DataflowLinkOptions having Append of false) linked to it could not receive any message (the result). As I said the processBlock does seem to still work fine (its Action does run the code inside and produce result logging normally). So this is still a very strange issue.
In a short word, the problem now becomes why the processBlock could not send/post its messages to all the other linked blocks? Is there any possible cause for that to occur? How to know if the blocks are linked successfully (after the call to .LinkTo)?

It's actually my fault, the processingBlock is actually blocked but it's blocked correctly and in a good way (by design).
The processingBlock is blocked by 2 factors:
The EnsureOrdered is true (as by default), so the output is always queued in the processed order.
There is at least one output result which cannot be pushed out (to some other block).
So if one output result cannot be pushed out, it will be a blocking item because of all the output results being queued in the processed order. All the after processed output results will simply be blocked (queued up) by the first output result that cannot be pushed out.
In my case the special output result that cannot be pushed out here is a null result. That null result can only be produced by some error (exception handling). So I have 2 blocks okResultBlock and notOkResultBlock linked to the processingBlock. But both those blocks are filtered to let only non-null results go through. Sorry that my question does not reflect the exact code I have, about the output type. In the question it is just a simple int but actually it's a class (nullable), the actual linking code looks like this:
processingBlock.LinkTo(okResultBlock, e => e != null && e.Point >= 9);
processingBlock.LinkTo(notOkResultBlock, e => e != null && e.Point < 9);
So the null output result will be blocked and consequentially block all the after processed result (because of the option EnsureOrdered being true by default).
To fix this, I just simply set the EnsureOrdered to false (although this is not required to avoid the blocking, but it's good in my case) and add one more block to consume the null output result (this is the most important to help avoid blocking):
processingBlock.LinkTo(DataflowBlock.NullTarget<Output>(), e => e == null);

Related

SqlBulkCopy.WriteToServerAsync() does not write to target SQL Server table, bulkCopy.WriteToServer() does

Just as the title states. I am trying to load a ~8.45GB csv file with ~330 columns (~7.5 million rows) into a SQL Server instance, but I'm doing the parsing internally as it the file has some quirks to it (with comma delimitations and quotes, etc). The heavy duty bulk insert and line parsing is done as below:
var dataTable = new DataTable(TargetTable);
using var streamReader = new StreamReader(FilePath);
using var bulkCopy = new SqlBulkCopy(this._connection, SqlBulkCopyOptions.TableLock, null)
{
DestinationTableName = TargetTable,
BulkCopyTimeout = 0,
BatchSize = BatchSize,
};
/// ...
var outputFields = new string[columnsInCsv];
this._connection.Open();
while ((line = streamReader.ReadLine()) != null)
{
//get data
CsvTools.ParseCsvLineWriteDirect(line, ref outputFields);
// insert into datatable
dataTable.LoadDataRow(outputFields, true);
// update counters
totalRows++;
rowCounter++;
if (rowCounter >= BatchSize)
{
try
{
// load data
bulkCopy.WriteToServer(dataTable); // this works.
//Task.Run(async () => await bulkCopy.WriteToServerAsync(dataTable)); // this does not.
//bulkCopy.WriteToServerAsync(dataTable)) // this does not write to the table either.
rowCounter = 0;
dataTable.Clear();
}
catch (Exception ex)
{
Console.Error.WriteLine(ex.ToString());
return;
}
}
}
// check if we have any remnants to load
if (dataTable.Rows.Count > 0)
{
bulkCopy.WriteToServer(dataTable); // same here as above
//Task.Run(async () => await bulkCopy.WriteToServerAsync(dataTable));
//bulkCopy.WriteToServerAsync(dataTable));
dataTable.Clear();
}
this._connection.Close();
Obviously I would like this to be fast as possible. I noticed via profiling that the WriteToServerAsync method was almost 2x as fast (in its execution duration) as the WriteToServer method, but when I use the async version, no data appears to be written to the target table (whereas the non-async version seems to commit the data fine but much more slowly). I'm assuming there is something here I forgot (to somehow trigger the commit to the table), but I am not sure what could prevent committing the data to the target table.
Note that I am aware that SQL Server has a BULK INSERT statement but I need more control over the data for other reasons and would prefer to do this in C#. Also perhaps relevant is that I am using SQL Server 2022 Developer edition.
Fire and forget tasks
By performing Task.Run(...) or DoSomethingAsync() without a corresponding await essentially makes the task a fire and forget task. The "fire" refers to the creation of the task and the "forget" due to the fact that the coder appears not to be interested in any return value (if applicable) or desires any knowledge as to when the task completes.
Though not immediately problematic, it is if the calling thread or process exits before the task completes! The task will be terminated before completion. This problem typically occurs in short-lived processes such as console apps, not so much for say Windows Services, web apps with 20-minute App Domain timeouts et all.
Example
sending an asynchronous keep-alive/heartbeat to a remote service or monitor.
there is no return value to monitor, asynchronous or otherwise
It won't matter if it fails as a more up-to-date call will eventually replace it
It won't matter if it doesn't complete in time if the hosting process exits for some reason (after-all we are a heartbeat, if the process is ended naturally there is no heart to beat).
Awaited tasks
Consider prefixing it with a await as in await bulkCopy.WriteToServerAsync(...);. This way the task is linked to the parent task/thread and ensures the former (unless it is terminated by some other means) does not exit before the task completes.
Naturally the containing method and those in the call stack will need to be marked async and also have await prefixes on the corresponding methods. This "async all the way" creates a nice daisy chain of linked tasks all the way to the parent (or at least to the last method in the call chain with an await or a legacy ContinueWith()).

c# make my a specific line to timeout after x seconds c#

I have a line in C# which does not work very reliable and does not time out at all and runs for infinity.
to be more precise i am trying to check the connection to a proxy WebClient.DownloadString
I want it to timeout after 5 seconds without making the full method asynchronous
so the code should be like this:
bool success = false
do_this_for_maximum_5_seconds_or_until_we_reach_the_end
{
WebClient.DownloadString("testurl");
success = true;
}
it will try to download testurl and after it did download it it will set success to true. If DownloadString takes more than 5 seconds, the call is canceled, we do not reach the the line where we set success to true, so it remains false and i know that it field.
The thread will remain frozen while we try to DownloadString, so the action is not taking parallel. The ONLY difference to a normal line would be that we set a timeout after 5 seconds
Please do not suggest alternatives such as using HttpClient, because i need a similar codes also for other places, so i simply want a code which will run in a synchronous application (i have not learned anything about asynchronus programing therefore i would like to avoid it completely)
my approach was like suggested by Andrew Arnott in this thread
Asynchronously wait for Task<T> to complete with timeout
however my issue is, I am not exactly sure what type of variable "SomeOperationAsync()" is in his example (i mean it seems like a task, but how can i put actions into the task?), and the bigger issue is that VS wants to switch the complete Method to asynchronos, but i want to run everything synchronous but just with a timeout for a specific line of code.
In case the question has been answered somewhere kindly provide a link
Thank you for any help!!
You should use Microsoft's Reactive Framework (aka Rx) - NuGet System.Reactive and add using System.Reactive.Linq; - then you can do this:
var downloadString =
Observable
.Using(() => new WebClient(), wc => Observable.Start(() => wc.DownloadString("testurl")))
.Select(x => new { success = true, result = x });
var timeout =
Observable
.Timer(TimeSpan.FromSeconds(5.0))
.Select(x => new { success = false, result = (string)null });
var operation = Observable.Amb(downloadString, timeout);
var output = await operation;
if (output.success)
{
Console.WriteLine(output.result);
}
The first observable downloads your string. The second sets up a timeout. The third, uses the Amb operator to get the result from which ever of the two input observables completes first.
Then we can await the third observable to get its value. And then it's a simple task to check what result you got.

Durable Function Fan out with Time limit - remains in "running" status

I am attempting to implement a timeout for my Durable function.
In my function, I am currently doing a fan out of activity functions, each of which call a separate API to collect current pricing data. (Price comparison site). All of this works well and I am happy with the results, however I need to implement a time out in case 1 or more APIs do not respond within a reasonable time (~15 seconds)
I am using the following pattern:
var parallelActivities = new List<Task<T>>
{
context.CallActivityAsync<T>( "CallApi1", input ),
context.CallActivityAsync<T>( "CallApi2", input ),
context.CallActivityAsync<T>( "CallApi3", input ),
context.CallActivityAsync<T>( "CallApi4", input ),
context.CallActivityAsync<T>( "CallApi5", input ),
context.CallActivityAsync<T>( "CallApi16", input )
};
var timeout = TimeSpan.FromSeconds(15);
var deadline = context.CurrentUtcDateTime.Add(timeout);
using ( var cts = new CancellationTokenSource() )
{
var timeoutTask = context.CreateTimer(deadline, cts.Token);
var taskRaceWinner = await Task.WhenAny(Task.WhenAll( parallelActivities ), timeoutTask);
if ( taskRaceWinner != timeoutTask )
{
cts.Cancel();
}
foreach ( var completedParallelActivity in parallelActivities.Where( task => task.Status == TaskStatus.RanToCompletion ) )
{
//Process results here
}
//More logic here
}
Everything seems to work correctly. If any activity doesn't return within the time limit, the timeout task wins, and the data is processed and returned correctly.
The Durable functions documentation indicates that the Orchestrator states "This mechanism does not actually terminate in-progress activity function execution. Rather, it simply allows the orchestrator function to ignore the result and move on. For more information, see the Timers documentation."
Unfortunately my function remains in the "running" status until it ultimately hits the durable function timeout and recycles.
Am I doing something wrong? I realize that, generally, the durable function will be marked as running until all activities have completed, however the documentation above indicates that I should be able to "ignore" the activities that are running too long.
I could implement a timeout in each individual API, however that doesn't seem like good design and I have been resisting. So, please help me stackoverflow!
According to this, The Durable Task Framework will not change an orchestration's status to "completed" until all outstanding tasks are completed or canceled even though output of those are ignored. Also, according to this and this, we can't cancel activity/sub-orchestration from parent at this moment. So, currently only way I can think of is to pass a Timeout param (of TimeSpan type) from parent as part of the input object to the activity (e.g. context.CallActivityAsync<T>( "CallApi1", input )) and let the child activity function handle it's exit respecting that timeout. I tested this myself and works fine. Please feel free to reach me for any follow up.

TPL DataFlow confusion around pipelines - should I create a new pipeline for each data call? How can I track data that's flowing through?

I'm struggling with how to apply TPL DataFlow to my application.
I've got a bunch of parallel data operations I want to track and manage, previously I was just using Tasks, but I'm trying to implement DataFlow to give me more control.
I'm composing a pipeline of tasks to say get the data and process it, here's an example of a pipeline to get data, process data, and log it as complete:
TransformBlock<string, string> loadDataFromFile = new TransformBlock<string, string>(filename =>
{
// read the data file (takes a long time!)
Console.WriteLine("Loading from " + filename);
Thread.Sleep(2000);
// return our result, for now just use the filename
return filename + "_data";
});
TransformBlock<string, string> prodcessData = new TransformBlock<string, string>(data =>
{
// process the data
Console.WriteLine("Processiong data " + data);
Thread.Sleep(2000);
// return our result, for now just use the data string
return data + "_processed";
});
TransformBlock<string, string> logProcessComplete= new TransformBlock<string, string>(data =>
{
// Doesn't do anything to the data, just performs an 'action' (but still passses the data long, unlike ActionBlock)
Console.WriteLine("Result " + data + " complete");
return data;
});
I'm linking them together like this:
// create a pipeline
loadDataFromFile.LinkTo(prodcessData);
prodcessData.LinkTo(logProcessComplete);
I've been trying to follow this tutorial.
My confusion is that in the tutorial this pipeline seems to be a 'fire once' operation. It creates the pipeline and fires it off once, and it completes. This seems counter to how the Dataflow library seems designed, I've read:
The usual way of using TPL Dataflow is to create all the blocks, link
them together, and then start putting data in one end.
From "Concurrency in C# Cookbook" by Stephen Cleary.
But I'm not sure how to track the data after I've put said data 'in one end'. I need to be able to get the processed data from multiple parts of the program, say the user presses two buttons, one to get the data from "File1" and do something with it, one to get the data from "File2", I'd need something like this I think:
public async Task loadFile1ButtonPress()
{
loadDataFromFile.Post("File1");
var data = await logProcessComplete.ReceiveAsync();
Console.WriteLine($"Got data1: {data}");
}
public async Task loadFile2ButtonPress()
{
loadDataFromFile.Post("File2");
var data = await logProcessComplete.ReceiveAsync();
Console.WriteLine($"Got data2: {data}");
}
If these are performed 'synchronously' it works just fine, as there's only one piece of information flowing through the pipeline:
Console.WriteLine("waiting for File 1");
await loadFile1ButtonPress();
Console.WriteLine("waiting for File 2");
await loadFile2ButtonPress();
Console.WriteLine("Done");
Produces the expected output:
waiting for File 1
Loading from File1
Processiong data File1_data
Result File1_data_processed complete
Got data1: File1_data_processed
waiting for File 2
Loading from File2
Processiong data File2_data
Result File2_data_processed complete
Got data2: File2_data_processed
Done
This makes sense to me, it's just doing them one at a time:
However, the point is I want to run these operations in parallel and asynchronously. If I simulate this (say, the user pressing both 'buttons' in quick succession) with:
Console.WriteLine("waiting");
await Task.WhenAll(loadFile1ButtonPress(), loadFile2ButtonPress());
Console.WriteLine("Done");
Does this work if the second operation takes longer than the first?
I was expecting both to return the first data however (Originally this didn't
work but it was a bug I've fixed - it does return the correct items now).
I was thinking I could link an ActionBlock<string> to perform the action with the data, something like:
public async Task loadFile1ButtonPress()
{
loadDataFromFile.Post("File1");
// instead of var data = await logProcessComplete.ReceiveAsync();
logProcessComplete.LinkTo(new ActionBlock<string>(data =>
{
Console.WriteLine($"Got data1: {data}");
}));
}
But this is changing the pipeline completely, now loadFile2ButtonPress won't work at all as it's using that pipeline.
Can I create multiple pipelines with the same Blocks? Or should I be creating a whole new pipeline (and new blocks) for each 'operation' (that seems to defeat the point of using the Dataflow library at all)
Not sure if this is best place for Stackoverflow or something like Codereview? Might be a bit subjective.
If you need some events to happen after some data has been processed, you should make your last block AsObservable, and add some small code with Rx.Net:
var subscription = logProcessComplete.AsObservable();
subscription.Subscribe(i => Console.WriteLine(i));
As been said in comments, you can link your blocks to more than one block, with a predicate. Note, that in that case, message will be delivered only to first matching block. You also may create a BroadcastBlock, which delivers a copy of the message to each linked block.
Make sure that unwanted by every other block messages are linked to NullTarget, as in other case they will stay in your pipeline forever, and will stop your completion.
Check that your pipeline correctly handles completion, as in case of multiple links the completion also being propagated only to the first linked block.

Cancelling and re-executing ReactiveCommand

I'm struggling with a ReactiveUI use case that I feel is so simple there must be "out-of-the-box" support for it. But I cannot find it.
The scenario is a basic search interface with these features:
A search string TextBox where the user enters the search text
A result TextBox where the result is presented
An indicator showing that a search is in progress
The search should work like this:
The search string TextBox is throttled, so that after 500ms of
inactivity, a search operation is initiated.
Each time a new search is initiated any ongoing search operation should be cancelled.
Basically I'm trying to extend the "Compelling example" to cancel the currently executing command before starting a new command.
Seems easy enough? Yeah, but I cannot get it right using ReactiveCommand. This is what I have:
var searchTrigger = this.WhenAnyValue(vm => vm.SearchString)
.Throttle(TimeSpan.FromMilliseconds(500))
.Publish().RefCount();
var searchCmd = ReactiveCommand.CreateFromObservable(
() => Observable
.StartAsync(ct => CancellableSearch(SearchString, ct))
.TakeUntil(searchTrigger));
searchCmd.ToPropertyEx(this, vm => vm.Result);
searchCmd.IsExecuting.ToPropertyEx(this, vm => vm.IsSearching);
searchTrigger.Subscribe(_ => searchCmd.Execute(Unit.Default).Subscribe());
The above code works in all aspects except searchCmd.IsExecuting. I kick off a new search regardless of the state of searchCmd.CanExecute. This makes IsExecuting unreliable since it assumes serial operation of the commands. And I cannot use InvokeCommand instead of Execute since then new searches would not be started while a search is in progress.
I currently have a working solution without ReactiveCommand. But I have a strong feeling this simple use case should be supported in a straightforward way using ReactiveCommand. What am i missing?
AFAICT Rx7 doesn't really handle this kind of overlapping execution. All the messages will eventually make it through but not in a way that will keep your IsExecuting consistently true. Rx6 used an In flight counter so overlapping executions were handled but Rx7 simplified it all way down. Most likely for performance and reliability (but I'm just guessing). Because Tasks aren't going to cancel right away that first command is going to complete after the second command starts which leads to IsExecuting toggling from true to false to true to false. But that middle transition from false to true to false happens instantly as the messages catch up. I know you said you had a non Reactive Command working but here's a version that I think works with Reactive Commands by waiting for the first command to finish or finish cancelling. One advantage to waiting until the Task actually cancels is that you are assured you don't have two hands in the cookie jar :-) Which might not matter in your case but can be nice in some cases.
//Fires an event right away so search is cancelled faster
var searchEntered = this.WhenAnyValue(vm => vm.SearchString)
.Where(x => !String.IsNullOrWhiteSpace(x))
.Publish()
.RefCount();
ReactiveCommand<string, string> searchCmd = ReactiveCommand.CreateFromObservable<string, string>(
(searchString) => Observable.StartAsync(ct => CancellableSearch(SearchString, ct))
.TakeUntil(searchEntered));
//if triggered wait for IsExecuting to transition back to false before firing command again
var searchTrigger =
searchEntered
.Throttle(TimeSpan.FromMilliseconds(500))
.Select(searchString => searchCmd.IsExecuting.Where(e => !e).Take(1).Select(_ => searchString))
.Publish()
.RefCount();
_IsSearching =
searchCmd.IsExecuting
.ToProperty(this, vm => vm.IsSearching);
searchTrigger
.Switch()
.InvokeCommand(searchCmd);

Categories