A datablock to join a single result with multiple other results - c#

In my application I want to join multiple strings with a dictionary of replacement values.
The readTemplateBlock gets fed with FileInfos and returns their contents as string.
The getReplacersBlock gets fed (once) with a single replacers dictionary.
The joinTemplateAndReplacersBlock should join each item of the readTemplateBlock with the one getReplacersBlock result.
In my current setup it requires me to post the same replacers dictionary again for each file I post.
// Build
var readTemplateBlock = new TransformBlock<FileInfo, string>(file => File.ReadAllText(file.FullName));
var getReplacersBlock = new WriteOnceBlock<IDictionary<string, string>>(null);
var joinTemplateAndReplacersBlock = new JoinBlock<string, IDictionary<string, string>>();
// Assemble
var propagateComplete = new DataflowLinkOptions {PropagateCompletion = true};
readTemplateBlock.LinkTo(joinTemplateAndReplacersBlock.Target1, propagateComplete);
getReplacersBlock.LinkTo(joinTemplateAndReplacersBlock.Target2, propagateComplete);
joinTemplateAndReplacersBlock.LinkTo(replaceTemplateBlock, propagateComplete);
// Post
foreach (var template in templateFilenames)
{
getFileBlock.Post(template);
}
getFileBlock.Complete();
getReplacersBlock.Post(replacers);
getReplacersBlock.Complete();
Is there a better block I'm missing? Maybe a configuration option I overlooked?

I couldn't figure out how to do this using the built-in dataflow blocks. The alternatives I can see:
Use a BufferBlock with small BoundedCapacity along with a Task that keeps sending the value to it. How exactly does the Task get the value could vary, but if you like WriteOnceBlock, you could reuse and encapsulate it:
static IPropagatorBlock<T, T> CreateWriteOnceRepeaterBlock<T>()
{
var target = new WriteOnceBlock<T>(null);
var source = new BufferBlock<T>(new DataflowBlockOptions { BoundedCapacity = 1 });
Task.Run(
async () =>
{
var value = await target.ReceiveAsync();
while (true)
{
await source.SendAsync(value);
}
});
return DataflowBlock.Encapsulate(target, source);
}
You would then use CreateWriteOnceRepeaterBlock<IDictionary<string, string>>() instead of new WriteOnceBlock<IDictionary<string, string>>(null).
Write a custom block similar to WriteOnceBlock that behaves exactly the way you want. Looking at how big the source of WriteOnceBlock is, this is probably not very appealing.
Use a TaskCompletionSource instead of dataflow blocks for this.
Assuming your current code looks something like this (using C# 7 and the System.ValueTuple package for brevity):
void ReplaceTemplateBlockAction(Tuple<string, IDictionary<string, string>> tuple)
{
var (template, replacers) = tuple;
…
}
…
var getReplacersBlock = new WriteOnceBlock<IDictionary<string, string>>(null);
var replaceTemplateBlock = new ActionBlock<Tuple<string, IDictionary<string, string>>>(
ReplaceTemplateBlockAction);
…
getReplacersBlock.Post(replacers);
You would instead use:
void ReplaceTemplateBlockAction(string template, IDictionary<string, string>>> replacers)
{
…
}
…
var getReplacersTcs = new TaskCompletionSource<IDictionary<string, string>>();
var replaceTemplateBlock = new ActionBlock<string>(
async template => ReplaceTemplateBlockAction(template, await getReplacersTcs.Task));
…
getReplacersTcs.SetResult(replacers);

Related

Using Parallel.ForEach Create multiple requests in parallel and put them in the list

So I had to create dozens of API requests and get json to make it an object and put it in a list.
I also wanted the requests to be parallel because I do not care about the order in which the objects enter the list.
public ConcurrentBag<myModel> GetlistOfDstAsync()
{
var req = new RequestGenerator();
var InitializedObjects = req.GetInitializedObjects();
var myList = new ConcurrentBag<myModel>();
Parallel.ForEach(InitializedObjects, async item =>
{
RestRequest request = new RestRequest("resource",Method.GET);
request.AddQueryParameter("key", item.valueOne);
request.AddQueryParameter("key", item.valueTwo);
var results = await GetAsync<myModel>(request);
myList.Add(results);
});
return myList;
}
What creates a new problem, I do not understand how to put them in the list and it seems I do not use a solution that exists in a form ConcurrentBag
Is my assumption correct and I implement it wrong or should I use another solution?
I also wanted the requests to be parallel
What you actually want is concurrent requests. Parallel does not work as expected with async.
To do asynchronous concurrency, you start each request but do not await the tasks yet. Then, you can await all the tasks together and get the responses using Task.WhenAll:
public async Task<myModel[]> GetlistOfDstAsync()
{
var req = new RequestGenerator();
var InitializedObjects = req.GetInitializedObjects();
var tasks = InitializedObject.Select(async item =>
{
RestRequest request = new RestRequest("resource",Method.GET);
request.AddQueryParameter("key", item.valueOne);
request.AddQueryParameter("key", item.valueTwo);
return await GetAsync<myModel>(request);
}).ToList();
var results = await TaskWhenAll(tasks);
return results;
}

Parallel execution of tasks in groups

I am describing my problem in a simple example and then describing a more close problem.
Imagine We Have n items [i1,i2,i3,i4,...,in] in the box1 and we have a box2 that can handle m items to do them (m is usually much less than n) . The time required for each item is different. I want to always have doing m job items until all items are proceeded.
A much more close problem is that for example you have a list1 of n strings (URL addresses) of files and we want to have a system to have m files downloading concurrently (for example via httpclient.getAsync() method). Whenever downloading of one of m items finishes, another remaining item from list1 must be substituted as soon as possible and this must be countinued until all of List1 items proceeded.
(number of n and m are specified by users input at runtime)
How this can be done?
Here is a generic method you can use.
when you call this TIn will be string (URL addresses) and the asyncProcessor will be your async method that takes the URL address as input and returns a Task.
The SlimSemaphore used by this method is going to allow only n number of concurrent async I/O requests in real time, as soon as one completes the other request will execute. Something like a sliding window pattern.
public static Task ForEachAsync<TIn>(
IEnumerable<TIn> inputEnumerable,
Func<TIn, Task> asyncProcessor,
int? maxDegreeOfParallelism = null)
{
int maxAsyncThreadCount = maxDegreeOfParallelism ?? DefaultMaxDegreeOfParallelism;
SemaphoreSlim throttler = new SemaphoreSlim(maxAsyncThreadCount, maxAsyncThreadCount);
IEnumerable<Task> tasks = inputEnumerable.Select(async input =>
{
await throttler.WaitAsync().ConfigureAwait(false);
try
{
await asyncProcessor(input).ConfigureAwait(false);
}
finally
{
throttler.Release();
}
});
return Task.WhenAll(tasks);
}
You should look in to TPL Dataflow, add the System.Threading.Tasks.Dataflow NuGet package to your project then what you want is as simple as
private static HttpClient _client = new HttpClient();
public async Task<List<MyClass>> ProcessDownloads(IEnumerable<string> uris,
int concurrentDownloads)
{
var result = new List<MyClass>();
var downloadData = new TransformBlock<string, string>(async uri =>
{
return await _client.GetStringAsync(uri); //GetStringAsync is a thread safe method.
}, new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = concurrentDownloads});
var processData = new TransformBlock<string, MyClass>(
json => JsonConvert.DeserializeObject<MyClass>(json),
new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded});
var collectData = new ActionBlock<MyClass>(
data => result.Add(data)); //When you don't specifiy options dataflow processes items one at a time.
//Set up the chain of blocks, have it call `.Complete()` on the next block when the current block finishes processing it's last item.
downloadData.LinkTo(processData, new DataflowLinkOptions {PropagateCompletion = true});
processData.LinkTo(collectData, new DataflowLinkOptions {PropagateCompletion = true});
//Load the data in to the first transform block to start off the process.
foreach (var uri in uris)
{
await downloadData.SendAsync(uri).ConfigureAwait(false);
}
downloadData.Complete(); //Signal you are done adding data.
//Wait for the last object to be added to the list.
await collectData.Completion.ConfigureAwait(false);
return result;
}
In the above code only concurrentDownloads number of HttpClients will be active at any given time, unlimited threads will be processing the received strings and turning them in to objects, and a single thread will be taking those objects and adding them to a list.
UPDATE: here is a simplified example that only does what you asked for in the question
private static HttpClient _client = new HttpClient();
public void ProcessDownloads(IEnumerable<string> uris, int concurrentDownloads)
{
var downloadData = new ActionBlock<string>(async uri =>
{
var response = await _client.GetAsync(uri); //GetAsync is a thread safe method.
//do something with response here.
}, new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = concurrentDownloads});
foreach (var uri in uris)
{
downloadData.Post(uri);
}
downloadData.Complete();
downloadData.Completion.Wait();
}
A simple solution for throttling is a SemaphoreSlim.
EDIT
After a slight alteration the code now creates the tasks when they are needed
var client = new HttpClient();
SemaphoreSlim semaphore = new SemaphoreSlim(m, m); //set the max here
var tasks = new List<Task>();
foreach(var url in urls)
{
// moving the wait here throttles the foreach loop
await semaphore.WaitAsync();
tasks.Add(((Func<Task>)(async () =>
{
//await semaphore.WaitAsync();
var response = await client.GetAsync(url); // possibly ConfigureAwait(false) here
// do something with response
semaphore.Release();
}))());
}
await Task.WhenAll(tasks);
This is another way to do it
var client = new HttpClient();
var tasks = new HashSet<Task>();
foreach(var url in urls)
{
if(tasks.Count == m)
{
tasks.Remove(await Task.WhenAny(tasks));
}
tasks.Add(((Func<Task>)(async () =>
{
var response = await client.GetAsync(url); // possibly ConfigureAwait(false) here
// do something with response
}))());
}
await Task.WhenAll(tasks);
Process items in parallel, limiting the number of simultaneous jobs:
string[] strings = GetStrings(); // Items to process.
const int m = 2; // Max simultaneous jobs.
Parallel.ForEach(strings, new ParallelOptions {MaxDegreeOfParallelism = m}, s =>
{
DoWork(s);
});

How can I convert set of if conditions with async method calls to Dictionary?

I've got some method calls looking like this
if (SectionContainedWithin(args, RegisterSection.StudentPersonalData))
schoolRegister.StudentPersonalData = await _sectionGeneratorsProvider.StudentPersonalDataGenerator.GenerateAsync(args);
if (SectionContainedWithin(args, RegisterSection.StudentAttendances))
schoolRegister.StudentAttendances = await _sectionGeneratorsProvider.StudentMonthAttendancesGenerator.GenerateAsync(args);
if (SectionContainedWithin(args, RegisterSection.Grades))
schoolRegister.Grades = await _sectionGeneratorsProvider.GradesGenerator.GenerateAsync(args);
// More generating here ...
Every GenerateAsync produces object of different type.
public interface IGenerator<TResult, in TArgs>
{
Task<TResult> GenerateAsync(TArgs args);
}
How can I rewrite those ifs so I could define a list of actions and conditions and then iterate through them.
Something like:
var sections = new Dictionary<bool, Func<Task>>()
{
{
SectionContainedWithin(args, RegisterSection.StudentPersonalData),
() => schoolRegister.StudentPersonalData = await _sectionGeneratorsProvider.StudentPersonalDataGenerator.GenerateAsync(args);
},
// More generating here ...
}
foreach(var item in sections)
{
if(item.Key)
{
await item.Value();
}
}
SOLUTION:
Thanks to #peter-duniho answer I dropped idea of creating Dictionary<bool, Func<Task>> and replaced it with IReadOnlyDictionary<RegisterSection, Func<RegisterXml, RegisterGenerationArgs, Task>> since it makes more sense.
Multiple keys, not only two true/false.
So thats what I ended up with
private IReadOnlyDictionary<RegisterSection, Func<RegisterXml, RegisterGenerationArgs, Task>> CreateSectionActionsDictionary()
{
return new Dictionary<RegisterSection, Func<RegisterXml, RegisterGenerationArgs, Task>>
{
{ RegisterSection.RegisterCover, async(reg, args) => reg.Cover = await _sectionGenerators.RegisterCoverGenerator.GenerateAsync(args) },
{ RegisterSection.StudentPersonalData, async(reg, args) => reg.StudentPersonalData = await _sectionGenerators.StudentPersonalDataGenerator.GenerateAsync(args)},
// Add more generating here ...
};
}
private async Task GenerateSectionsAsync(RegisterGenerationArgs args, RegisterXml reg)
{
foreach (var sectionAction in SectionActions)
if (SectionContainedWithin(args, sectionAction.Key))
await sectionAction.Value(reg, args);
}
If I understand the code example correctly, a dictionary really isn't the right tool for this job. You appear to be using it only to store pairs of values; you're not using the primary feature of a dictionary, which is to be able to map a key value you already know to another value.
In addition, the code example you are proposing won't work because you need the args value in order to evaluated the SectionContainedWithin() call. Even if you were trying to declare the dictionary in a context where args is valid and can be used to initialize the dictionary, you would have the problem that it makes the key type bool, which would mean you could only have two entries in the dictionary at most, and would have no way to actually handle all of the combinations where the SectionContainedWithin() method returned true.
Without a good Minimal, Complete, and Verifiable code example that shows clearly what you're doing, it's impossible to know for sure what exactly you need. But it will look something like this:
struct SectionGenerator<TArgs>
{
public readonly RegisterSection RegisterSection;
public readonly Func<TArgs, Task> Generate;
public SectionGenerator(RegisterSection registerSection, Func<TArgs, Task> generate)
{
RegisterSection = registerSection;
Generate = generate;
}
}
SectionGenerator<TArgs>[] generators =
{
new SectionGenerator<TArgs>(RegisterSection.StudentPersonalData,
async args => schoolRegister.StudentPersonalData = await _sectionGeneratorsProvider.StudentPersonalDataGenerator.GenerateAsync(args);
// etc.
}
Then you can do something like:
foreach (SectionGenerator<TArgs> generator in generators)
{
if (SectionContainedWithin(args, generator.RegisterSection))
{
await generator.Generate(args);
}
}
Assuming it's reasonable for all these async operations to be in progress concurrently, you could even do something like this:
await Task.WhenAll(generators
.Where(g => SectionContainedWithin(args, g.RegisterSection))
.Select(g => g.Generate(args));

Limiting concurrent requests using Rx and SelectMany

I have a list of URLs of pages I want to download concurrently using HttpClient. The list of URLs can be large (100 or more!)
I have currently have this code:
var urls = new List<string>
{
#"http:\\www.amazon.com",
#"http:\\www.bing.com",
#"http:\\www.facebook.com",
#"http:\\www.twitter.com",
#"http:\\www.google.com"
};
var client = new HttpClient();
var contents = urls
.ToObservable()
.SelectMany(uri => client.GetStringAsync(new Uri(uri, UriKind.Absolute)));
contents.Subscribe(Console.WriteLine);
The problem: due to the usage of SelectMany, a big bunch of Tasks are created almost at the same time. It seems that if the list of URLs is big enough, a lot Tasks give timeouts (I'm getting "A Task was cancelled" exceptions).
So, I thought there should be a way, maybe using some kind of Scheduler, to limit the number of concurrent Tasks, not allowing more than 5 or 6 at a given time.
This way I could get concurrent downloads without launching too many tasks that may get stall, like they do right now.
How to do that so I don't saturate with lots of timed-out Tasks?
Remember SelectMany() is actually Select().Merge(). While SelectMany does not have a maxConcurrent paramter, Merge() does. So you can use that.
From your example, you can do this:
var urls = new List<string>
{
#"http:\\www.amazon.com",
#"http:\\www.bing.com",
#"http:\\www.facebook.com",
#"http:\\www.twitter.com",
#"http:\\www.google.com"
};
var client = new HttpClient();
var contents = urls
.ToObservable()
.Select(uri => Observable.FromAsync(() => client.GetStringAsync(uri)))
.Merge(2); // 2 maximum concurrent requests!
contents.Subscribe(Console.WriteLine);
Here is an example of how you can do it with the DataFlow API:
private static Task DoIt()
{
var urls = new List<string>
{
#"http:\\www.amazon.com",
#"http:\\www.bing.com",
#"http:\\www.facebook.com",
#"http:\\www.twitter.com",
#"http:\\www.google.com"
};
var client = new HttpClient();
//Create a block that takes a URL as input
//and produces the download result as output
TransformBlock<string,string> downloadBlock =
new TransformBlock<string, string>(
uri => client.GetStringAsync(new Uri(uri, UriKind.Absolute)),
new ExecutionDataflowBlockOptions
{
//At most 2 download operation execute at the same time
MaxDegreeOfParallelism = 2
});
//Create a block that prints out the result
ActionBlock<string> doneBlock =
new ActionBlock<string>(x => Console.WriteLine(x));
//Link the output of the first block to the input of the second one
downloadBlock.LinkTo(
doneBlock,
new DataflowLinkOptions { PropagateCompletion = true});
//input the urls into the first block
foreach (var url in urls)
{
downloadBlock.Post(url);
}
downloadBlock.Complete(); //Mark completion of input
//Allows consumer to wait for the whole operation to complete
return doneBlock.Completion;
}
static void Main(string[] args)
{
DoIt().Wait();
Console.WriteLine("Done");
Console.ReadLine();
}
Can you see if this helps?
var urls = new List<string>
{
#"http:\\www.amazon.com",
#"http:\\www.bing.com",
#"http:\\www.google.com",
#"http:\\www.twitter.com",
#"http:\\www.google.com"
};
var contents =
urls
.ToObservable()
.SelectMany(uri =>
Observable
.Using(
() => new System.Net.Http.HttpClient(),
client =>
client
.GetStringAsync(new Uri(uri, UriKind.Absolute))
.ToObservable()));

Paralell.ForEach with HttpClient and ContinueWith

I have a method that attempts to download data from several URLs in Parallel, and return an IEnumerable of Deserialized types
The method looks like this:
public IEnumerable<TContent> DownloadContentFromUrls(IEnumerable<string> urls)
{
var list = new List<TContent>();
Parallel.ForEach(urls, url =>
{
lock (list)
{
_httpClient.GetAsync(url).ContinueWith(request =>
{
var response = request.Result;
//todo ensure success?
response.Content.ReadAsStringAsync().ContinueWith(text =>
{
var results = JObject.Parse(text.Result)
.ToObject<IEnumerable<TContent>>();
list.AddRange(results);
});
});
}
});
return list;
}
In my unit test (I stub out _httpClient to return a known set of text) I basically get
Sequence contains no elements
This is because the method is returning before the tasks have completed.
If I add .Wait() on the end of my .ContinueWith() calls, it passes, but I'm sure that I'm misusing the API here...
If you want a blocking call which downloads in parallel using the HttpClient.GetAsync method then you should implement it like so:
public IEnumerable<TContent> DownloadContentFromUrls<TContent>(IEnumerable<string> urls)
{
var queue = new ConcurrentQueue<TContent>();
using (var client = new HttpClient())
{
Task.WaitAll(urls.Select(url =>
{
return client.GetAsync(url).ContinueWith(response =>
{
var content = JsonConvert.DeserializeObject<IEnumerable<TContent>>(response.Result.Content.ReadAsStringAsync().Result);
foreach (var c in content)
queue.Enqueue(c);
});
}).ToArray());
}
return queue;
}
This creates an array of tasks, one for each Url, which represents a GetAsync/Deserialize operation. This is assuming that the Url returns a Json array of TContent. An empty array or a single member array will deserialize fine, but not a single array-less object.

Categories