I have a TPL Dataflow, that worked fine using only a transform block, then an action block.
I've added a new action block to be executed simultaneously with the existing action block, but my new action block is never getting hit. There are no errors or exceptions being thrown.
Is there a step that I need to add to my code?
var ListDocId = new ConcurrentBag<string>(ConvertDataSetToList(IdDocDataSet));
if (ListDocId.Any())
{
var num_thread = GetThreadNumber();
//Initialize the pipeline of actions
var downloadBlock = new TransformBlock<string, RequestObject>(docId =>
new RequestObject
{
DownloadedFile = ListDownloadocId),
IdDoc = docId
},
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 }
);
var uploadInS3Block = new ActionBlock<S3RequestUpload>(requestS3Upload =>
UploadFileAsync(RequestObject.DownloadedFile, RequestObject.IdDoc),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 }
);
var InsertdocIdIntoDbBlock = new ActionBlock<RequestObject>(s3Request =>
InsertIntotDataBase(s3Request.IdDoc, InsertDate),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 }
);
var options = new DataflowLinkOptions { PropagateCompletion = true };
downloadBlock.LinkTo(uploadInS3Block, options);
downloadBlock.LinkTo(InsertdocIdIntoDbBlock, options);
foreach (var idDoc in ListDocId)
downloadInAsterionBlock.Post(idDoc);
downloadBlock.Complete();
//uploadInS3Block.Completion.Wait();
//InsertdocIdIntoDbBlock.Completion.Wait();
Task.WhenAll(uploadInS3Block.Completion,
InsertdocIdIntoDbBlock.Completion).Wait();
You cannot link a TransformBlock to more than one other block. So only the first call to downloadInAsterionBlock.LinkTo() is considered.
You need to put in a BroadcastBlock between the downloadBlock and the two ActionBlock blocks.
downloadBlock -> broadcastBlock -> uploadInS3Block
-> InsertdocIdIntoDbBlock
In code it will look like this:
var bc = new BroadcastBlock<RequestObject>(ro => ro);
downloadBlock.LinkTo(bc);
bc.LinkTo(uploadInS3Block);
bc.LinkTo(InsertdocIdIntoDbBlock);
Related
I am writing sample applications in .Net core to interact with Kafka.
I have downloaded Kafka and Zookeeper official docker images to my machine.
I am using Confluent.Kafka nuget package for both producer and consumer. I am able to produce the message to Kafka. But my consumer part is not working.
Below is my producer and consumer code snippet. I am not sure what mistake I'm doing here.
Do we need to Explicitly create a consumer group?
Consumer Code (This code not working. Thread waiting at consumer.Consume(cToken);)
var config = new ConsumerConfig
{
BootstrapServers = "localhost:9092",
GroupId = "myroupidd",
AutoOffsetReset = AutoOffsetReset.Earliest,
};
var cToken = ctokenSource.Token;
var consumerBuilder = new ConsumerBuilder<Null, string>(config);
consumerBuilder.SetPartitionsAssignedHandler((consumer, partitionlist) =>
{
consumer.Assign(new TopicPartition("myactual-toppics", 0) { });
Console.WriteLine("inside SetPartitionsAssignedHandler action");
});
using var consumer = consumerBuilder.Build();
consumer.Subscribe("myactual-toppics");
while (!cToken.IsCancellationRequested)
{
var consumeResult = consumer.Consume(cToken);
if (consumeResult.Message != null)
Console.WriteLine(consumeResult.Message.Value);
}
Producer Code (This is working fine. I am able to see the messages using Conduckto Tool)
var config = new ProducerConfig
{
BootstrapServers = "localhost:9092",
ClientId = Dns.GetHostName(),
};
using (var producer = new ProducerBuilder<Null, string>(config).Build())
{
while (!stoppingToken.IsCancellationRequested)
{
var top = new TopicPartition("myactual-toppics", 0);
var result = await producer.ProduceAsync(top, new Message<Null, string> { Value = "My First Message" });
Console.WriteLine($"Publishedss1234" );
await Task.Delay(5000, stoppingToken);
I am trying to add multiple schemas to the same subject in the schema registry, so I have set ValueSubjectNameStrategy to SubjectNameStrategy.TopicRecord, also set the register automatically to AutomaticRegistrationBehavior.Always. But while auto registering the schema it still using the SubjectNameStrategy.Topic strategy.
var schemaRegistryConfig = new SchemaRegistryConfig { Url = "http://localhost:8081", ValueSubjectNameStrategy = SubjectNameStrategy.TopicRecord };
var registry = new CachedSchemaRegistryClient(schemaRegistryConfig);
var builder = new ProducerBuilder<string, SplitLineKGN>(KafkaConfig.Producer.GetConfig(_config.GetSection("KafkaProducer")))
.SetAvroValueSerializer(registry, registerAutomatically: AutomaticRegistrationBehavior.Always)
.SetErrorHandler((_, error) => Console.Error.WriteLine(error.ToString()));
_producerMsg = builder.Build();
await _producerMsg.ProduceAsync("MyTopic", new Message<string, SampleMessage> { Key = key, Value = line });
how to auto register multiple schemas to a topic?
Ensure that you changed a subject naming strategy for a topic
SchemaRegistryConfig.ValueSubjectNameStrategy is deprecated, it should now be configured using the serializer's configuration: code
For producing multiple event types with a single producer you have to use AvroSerializer<ISpecificRecord> as described below:
var schemaRegistryConfig = new SchemaRegistryConfig { Url = "http://localhost:8081" };
using var schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryConfig);
var avroSerializerConfig = new AvroSerializerConfig
{
SubjectNameStrategy = SubjectNameStrategy.TopicRecord,
AutoRegisterSchemas = true // (the default)
};
// Assuming this is your own custom code because the Confluent
// producer doesn't have anything like this.
var producerConfig = KafkaConfig.Producer.GetConfig(_config.GetSection("KafkaProducer"));
using var producer = new ProducerBuilder<string, ISpecificRecord>(producerConfig)
.SetValueSerializer(new AvroSerializer<ISpecificRecord>(schemaRegistryClient, avroSerializerConfig))
.SetErrorHandler((_, error) => Console.Error.WriteLine(error))
.Build();
var deliveryResult = await producer.ProduceAsync("MyTopic", new Message<string, ISpecificRecord>
{
Key = key,
Value = line
});
Console.WriteLine($"Delivered to: {deliveryResult.TopicPartitionOffset}");
What am I doing:
I have a microservice, part of an order processing system, which constantly consumes order messages from RabbitMQ and I need to temporary keep them in my microservice DB, until they are handled (accepted/declined). The order messages are forwarded to a TPL Dataflow pipeline, which has two branches.
The first one being for processing 'created' orders.
When an order is created I perform some operations, such as statistics for the order, validations, persisting in to DB and notifying the user through SignalR. (I have 5 blocks for this branch, and some include I/O network calls).
The second branch (2 blocks) is for accepted/declined orders. When I receive an order with status accepted/declined I need to remove it from my DB and also notify the user through SignalR.
The message for 'created' and 'accepted/declined' order differs only in its status property.
The forwarding to each of the two branches happens through TPL Dataflow's link predicate, based on the order status.
My problem:
Sometimes a message for declined/accpeted order arrives 50-150 ms after the message for the same order being created. Usually, 50-150 ms are a lot of time in computing, but the first dataflow branch is dependent on external calls to other services, which might cause a delay in the processing.
I want to make sure that I have fully processed the message with status 'created' and only after that to process the message for the same order being 'accepted/declined'.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
using System.Threading.Tasks.Dataflow;
namespace ConsoleApp12
{
public static class Program
{
static void Main(string[] args)
{
var linkOptions = new DataflowLinkOptions { PropagateCompletion = true };
var executionOptions = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 1, // potentially would be increased
BoundedCapacity = 50
};
var deserialize = new TransformBlock<OrderStatus, Order>(o =>
{
return new Order { Status = o };
}, executionOptions);
#region Creted order
var b11 = new TransformBlock<Order, Order>(async o =>
{
await Task.Delay(15); // do something
return o;
}, executionOptions);
var b12 = new TransformBlock<Order, Order>(async o =>
{
await Task.Delay(15); // do something
return o;
}, executionOptions);
var b13 = new TransformBlock<Order, Order>(async o =>
{
await Task.Delay(15);
Console.WriteLine("Saved In DB");
return o;
}, executionOptions);
var b14 = new ActionBlock<Order>(async o =>
{
await Task.Delay(5);
Console.WriteLine("SignalR order created");
}, executionOptions);
#endregion
#region Accepted/Declined
var b21 = new ActionBlock<Order>(async o =>
{
await Task.Delay(5);
Console.WriteLine("Deleted from DB");
}, executionOptions);
var b22 = new ActionBlock<Order>(async o =>
{
await Task.Delay(5);
Console.WriteLine("SignalR order deleted");
}, executionOptions);
#endregion
var deleteFromDbAndSignalRInParallelJob = new List<ITargetBlock<Order>> { b21, b22 }.CreateGuaranteedBroadcastBlock();
deserialize.LinkTo(b11, linkOptions, x => x.Status == OrderStatus.Created);
b11.LinkTo(b12, linkOptions);
b12.LinkTo(b13, linkOptions);
b13.LinkTo(b14, linkOptions);
deserialize.LinkTo(deleteFromDbAndSignalRInParallelJob, linkOptions);
deserialize.Post(OrderStatus.Created);
Thread.Sleep(30); // delay between messaged
deserialize.Post(OrderStatus.Declined);
Console.ReadKey();
}
}
class Order
{
public OrderStatus Status { get; init; }
}
enum OrderStatus
{
Created = 1,
Declined = 2
}
public static class DataflowExtensions
{
public static ITargetBlock<T> CreateGuaranteedBroadcastBlock<T>(this IEnumerable<ITargetBlock<T>> targets)
{
var targetsList = targets.ToList();
return new ActionBlock<T>(async item =>
{
var tasks = targetsList.Select(t => t.SendAsync(item));
await Task.WhenAll(tasks);
},
new ExecutionDataflowBlockOptions { BoundedCapacity = 100 });
}
}
}
Here is a sample with simplified models and logic simulating that the 'created' order branch is taking more time to complete.
The output is:
SignalR order deleted
Deleted from DB
Saved In DB
SignalR order created
I am having a problem with Parallel.ForEach. I have written simple application that adds file names to be downloaded to the queue, then using while loop it iterates through the queue, downloads file one at a time, then when file has been downloaded, another async method is called to create object from downloaded memoryStream. Returned result of this method is not awaited, it is discarded, so the next download starts immediately. Everything works fine if I use simple foreach in object creation - objects are being created while download is continuing. But if I would like to speed up the object creation process and use Parallel.ForEach it stops download process until the object is created. UI is fully responsive, but it just won't download the next object. I don't understand why is this happening - Parallel.ForEach is inside await Task.Run() and to my limited knowledge about asynchronous programming this should do the trick. Can anyone help me understand why is it blocking first method and how to avoid it?
Here is a small sample:
public async Task DownloadFromCloud(List<string> constructNames)
{
_downloadDataQueue = new Queue<string>();
var _gcsClient = StorageClient.Create();
foreach (var item in constructNames)
{
_downloadDataQueue.Enqueue(item);
}
while (_downloadDataQueue.Count > 0)
{
var memoryStream = new MemoryStream();
await _gcsClient.DownloadObjectAsync("companyprojects",
_downloadDataQueue.Peek(), memoryStream);
memoryStream.Position = 0;
_ = ReadFileXml(memoryStream);
_downloadDataQueue.Dequeue();
}
}
private async Task ReadFileXml(MemoryStream memoryStream)
{
var reader = new XmlReader();
var properties = reader.ReadXmlTest(memoryStream);
await Task.Run(() =>
{
var entityList = new List<Entity>();
foreach (var item in properties)
{
entityList.Add(CreateObjectsFromDownloadedProperties(item));
}
//Parallel.ForEach(properties item =>
//{
// entityList.Add(CreateObjectsFromDownloadedProperties(item));
//});
});
}
EDIT
This is simplified object creation method:
public Entity CreateObjectsFromDownloadedProperties(RebarProperties properties)
{
var path = new LinearPath(properties.Path);
var section = new Region(properties.Region);
var sweep = section.SweepAsMesh(path, 1);
return sweep;
}
Returned result of this method is not awaited, it is discarded, so the next download starts immediately.
This is also dangerous. "Fire and forget" means "I don't care when this operation completes, or if it completes. Just discard all exceptions because I don't care." So fire-and-forget should be extremely rare in practice. It's not appropriate here.
UI is fully responsive, but it just won't download the next object.
I have no idea why it would block the downloads, but there's a definite problem in switching to Parallel.ForEach: List<T>.Add is not threadsafe.
private async Task ReadFileXml(MemoryStream memoryStream)
{
var reader = new XmlReader();
var properties = reader.ReadXmlTest(memoryStream);
await Task.Run(() =>
{
var entityList = new List<Entity>();
Parallel.ForEach(properties, item =>
{
var itemToAdd = CreateObjectsFromDownloadedProperties(item);
lock (entityList) { entityList.Add(itemToAdd); }
});
});
}
One tip: if you have a result value, PLINQ is often cleaner than Parallel:
private async Task ReadFileXml(MemoryStream memoryStream)
{
var reader = new XmlReader();
var properties = reader.ReadXmlTest(memoryStream);
await Task.Run(() =>
{
var entityList = proeprties
.AsParallel()
.Select(CreateObjectsFromDownloadedProperties)
.ToList();
});
}
However, the code still suffers from the fire-and-forget problem.
For a better fix, I'd recommend taking a step back and using something more suited to "pipeline"-style processing. E.g., TPL Dataflow:
public async Task DownloadFromCloud(List<string> constructNames)
{
// Set up the pipeline.
var gcsClient = StorageClient.Create();
var downloadBlock = new TransformBlock<string, MemoryStream>(async constructName =>
{
var memoryStream = new MemoryStream();
await gcsClient.DownloadObjectAsync("companyprojects", constructName, memoryStream);
memoryStream.Position = 0;
return memoryStream;
});
var processBlock = new TransformBlock<MemoryStream, List<Entity>>(memoryStream =>
{
var reader = new XmlReader();
var properties = reader.ReadXmlTest(memoryStream);
return proeprties
.AsParallel()
.Select(CreateObjectsFromDownloadedProperties)
.ToList();
});
var resultsBlock = new ActionBlock<List<Entity>>(entities => { /* TODO */ });
downloadBlock.LinkTo(processBlock, new DataflowLinkOptions { PropagateCompletion = true });
processBlock.LinkTo(resultsBlock, new DataflowLinkOptions { PropagateCompletion = true });
// Push data into the pipeline.
foreach (var constructName in constructNames)
await downloadBlock.SendAsync(constructName);
downlodBlock.Complete();
// Wait for pipeline to complete.
await resultsBlock.Completion;
}
My media bag is getting populated inside of the foreach, but when it hits the bottom line the mediaBag is empty?
var mediaBag = new ConcurrentBag<MediaDto>();
Parallel.ForEach(mediaList,
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
async media =>
{
var imgBytes = await this.blobStorageService.ReadMedia(media.BlobID, Enums.MediaType.Image);
var fileContent = Convert.ToBase64String(imgBytes);
var image = new MediaDto()
{
ImageId = media.MediaID,
Title = media.Title,
Description = media.Description,
ImageContent = fileContent
};
mediaBag.Add(image);
});
return mediaBag.ToList();
Is this because of my blobstorage function not being thread safe? what would this mean and what is the soultion if that is the case.
Parallel.ForEach doesn't work well with async actions.
You could start and store the tasks returned by ReadMedia in an array and then wait for them all to complete using Task.WhenAll before you create the MediaDto objects in parallel. Something like this:
var mediaBag = new ConcurrentBag<MediaDto>();
Task<byte[]>[] tasks = mediaList.Select(media => blobStorageService.ReadMedia(media.BlobID, Enums.MediaType.Image)).ToArray();
await Task.WhenAll(tasks);
Parallel.ForEach(imgBytes, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
bytes =>
{
var fileContent = Convert.ToBase64String(imgBytes);
var image = new MediaDto()
{
ImageId = media.MediaID,
Title = media.Title,
Description = media.Description,
ImageContent = fileContent
};
mediaBag.Add(image);
});
return mediaBag.ToList();
Parallelism isn't concurrency. Parallel.ForEach is meant for data parallelism, not executing concurrent actions. It partitions the input data and uses as many worker tasks as there are cores to process one partition each. It doesn't work at all with asynchronous methods because that would defeat its very purpose.
What you ask for is concurrent operations - eg downloading 100 files, 4 or 6 at a time. One way would be to just launch all 100 tasks and wait for them to finish. That's a bit extreme and will probably flood the network connection.
A better way to do this would be to use a TPL Dataflow block like TransformBlock with a specific DOP, eg :
var options = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 };
var buffer=new BufferBlock<MediaDto>();
var block=new TransformBlock<ThatMedia,MediaDto>(media =>{
var imgBytes = await this.blobStorageService.ReadMedia(media.BlobID, Enums.MediaType.Image);
var fileContent = Convert.ToBase64String(imgBytes);
var image = new MediaDto()
{
ImageId = media.MediaID,
Title = media.Title,
Description = media.Description,
ImageContent = fileContent
};
return image;
},options);
block.LinkTo(buffer);
After that, you can start posting entries to the block.
foreach(var entry in mediaList)
{
block.Post(entry);
}
block.Complete();
await block.Completion;
if(buffer.TryReceiveAll(out var theNewList))
{
...
}
Thanks for the advice, i believe i may of misunderstood a 'Parallel.ForEach' usecase.
i have modified the function to use a list of tasks instead and it works very nicely. Below is the changes i made.
var mediaBag = new ConcurrentBag<MediaDto>();
IEnumerable<Task> mediaTasks = mediaList.Select(async m =>
{
var imgBytes = await this.blobStorageService.ReadMedia(m.BlobID, Enums.MediaType.Image);
var fileContent = Convert.ToBase64String(imgBytes);
var image = new MediaDto()
{
ImageId = m.MediaID,
Title = m.Title,
Description = m.Description,
ImageContent = fileContent
};
mediaBag.Add(image);
});
await Task.WhenAll(mediaTasks);
return mediaBag.ToList();