Converting a IEnumerable<T> to IObservable<T>, with maximum parallelism

Converting a IEnumerable<T> to IObservable<T>, with maximum parallelism - c#

I have a sequence of async tasks to do (say, fetch N web pages). Now what I want is to expose them all as an IObservable<T>. My current solution uses the answer from this question:
async Task<ResultObj> GetPage(string page) {
Console.WriteLine("Before");
var result = await FetchFromInternet(page);
Console.WriteLine("After");
return result;
}
// pages is an IEnumerable<string>
IObservable<ResultObj> resultObservable =pages.Select(GetPage).
Select(t => Observable.FromAsync(() => t)).Merge();
// Now consume the list
foreach(ResultObj obj in resultObservable.ToEnumerable()) {
Console.WriteLine(obj.ToString());
}
The problem is that I do not know the number of pages to be fetched, and it could be large. I do not want to make hundreds of simultaneous requests. So I want a way to limit the maximum number of tasks that will be executed in parallel. Is there a way to limit the number of concurrent invocations of GetPage?
There is a Merge overload that takes a maxConcurrent parameter, but it does not seem to actually limit the concurrency of the function invokation. THe console prints all the Before messages before the After messages.
Note: I need to convert back to IEnumerable<T>. I'm writing a data source for a system that gives me descriptors of data to fetch, and I need to give it back a list of the downloaded data.

EDIT
The following should work. This overload limits the number of concurrent subscriptions.
var resultObservable = pages
.Select(p => Observable.FromAsync(() => GetPage(p)))
.Merge(maxConcurrent);
Explanation
In order to understand why this change is needed we need some background
FromAsync returns an observable that will invoke the passed Func every time it is subscribed to. This implies that if the observable is never subscribed to, it will never be invoked.
Merge eagerly reads the source sequence, and only subscribes to n observables simultaneously.
With these two pieces we can know why the original version will execute everything in parallel: because of (2), GetPage will have already been invoked for all the source strings by the time Merge decides how many observables need to be subscribed.
And we can also see why the second version works: even though the sequence has been fully iterated over, (1) means that GetPage is not invoked until Merge decides it needs to subscribe to n observables. This leads to the desired result of only n tasks being executed simultaneously.

Related

System.Reactive Throttling an async method

I have been putting off using reactive extensions for so long, and I thought this would be a good use. Quite simply, I have a method that can be called for various reasons on various code paths
private async Task GetProductAsync(string blah) {...}
I need to be able to throttle this method. That's to say, I want to stop the flow of calls until no more calls are made (for a specified period of time). Or more clearly, if 10 calls to this method happen within a certain time period, i want to limit (throttle) it to only 1 call (after a period) when the last call was made.
I can see an example using a method with IEnumerable, this kind of makes sense
static IEnumerable<int> GenerateAlternatingFastAndSlowEvents()
{ ... }
...
var observable = GenerateAlternatingFastAndSlowEvents().ToObservable().Timestamp();
var throttled = observable.Throttle(TimeSpan.FromMilliseconds(750));
using (throttled.Subscribe(x => Console.WriteLine("{0}: {1}", x.Value, x.Timestamp)))
{
Console.WriteLine("Press any key to unsubscribe");
Console.ReadKey();
}
Console.WriteLine("Press any key to exit");
Console.ReadKey();
However, (and this has always been my major issue with Rx, forever), how do I create an Observable from a simple async method.
Update
I have managed to find an alternative approach using ReactiveProperty
Barcode = new ReactiveProperty<string>();
Barcode.Select(text => Observable.FromAsync(async () => await GetProductAsync(text)))
.Throttle(TimeSpan.FromMilliseconds(1000))
.Switch()
.ToReactiveProperty();
The premise is I catch it at the text property Barcode, however it has its own drawbacks, as ReactiveProperty takes care of notification, and I cant silently update the backing field as its already managed.
To summarise, how can I convert an async method call to Observable, so I can user the Throttle method?

Unrelated to your question, but probably helpful: Rx's Throttle operator is really a debounce operator. The closest thing to a throttling operator is Sample. Here's the difference (assuming you want to throttle or debounce to one item / 3 seconds):
items : --1-23----4-56-7----8----9-
throttle: --1--3-----4--6--7--8-----9
debounce: --1-------4--6------8----9-
Sample/throttle will bunch items that arrive in the sensitive time and emit the last one on the next sampling tick. Debounce throws away items that arrive in the sensitive time, then re-starts the clock: The only way for an item to emit is if it was preceded by Time-Range of silence.
RX.Net's Throttle operator does what debounce above depicts. Sample does what throttle above depicts.
If you want something different, describe how you want to throttle.

There are two key ways of converting a Task to an Observable, with an important difference between them.
Observable.FromAsync(()=>GetProductAsync("test"));
and
GetProductAsync("test").ToObservable();
The first will not start the Task until you subscribe to it.
The second will create (and start) the task and the result will either immediately or sometime later appear in the observable, depending on how fast the Task is.
Looking at your question in general though, it seems that you want to stop the flow of calls. You do not want to throttle the flow of results, which would result in unnecessary computation and loss.
If this is your aim, your GetProductAsync could be seen as an observer of call events, and the GetProductAsync should throttle those calls. One way of achieving that would be to declare a
public event Action<string> GetProduct;
and use
var callStream= Observable.FromEvent<string>(
handler => GetProduct+= handler ,
handler => GetProduct-= handler);
The problem then becomes how to return the result and what should happen when your 'caller's' call is throttled out and discarded.
One approach there could be to declare a type "GetProductCall" which would have the input string and output result as properties.
You could then have a setup like:
var callStream= Observable.FromEvent<GetProductCall>(
handler => GetProduct+= handler ,
handler => GetProduct-= handler)
.Throttle(...)
.Select(r=>async r.Result= await GetProductCall(r.Input).ToObservable().FirstAsync());
(code not tested, just illustrative)
Another approach might include the Merge(N) overload that limits the max number of concurrent observables.

Parallel or async ASP.NET Core C#

I've googled this plenty but I'm afraid I don't fully understand the consequences of concurrency and parallelism.
I have about 3000 rows of database objects that each have an average of 2-4 logical data attached to them that need to be validated as a part of a search query, meaning the validation service needs to execute approx. 3*3000 times. E.g. the user has filtered on color then each row needs to validate the color and return the result. The loop cannot break when a match has been found, meaning all logical objects will always need to be evaluated (this is due to calculations of relevance and just not a match).
This is done on-demand when the user selects various properties, meaning performance is key here.
I'm currently doing this by using Parallel.ForEach but wonder if it is smarter to use async behavior instead?
Current way
var validatorService = new LogicalGroupValidatorService();
ConcurrentBag<StandardSearchResult> results = new ConcurrentBag<StandardSearchResult>();
Parallel.ForEach(searchGroups, (group) =>
{
var searchGroupResult = validatorService.ValidateLogicGroupRecursivly(
propertySearchQuery, group.StandardPropertyLogicalGroup);
result.Add(new StandardSearchResult(searchGroupResult));
});
Async example code
var validatorService = new LogicalGroupValidatorService();
List<StandardSearchResult> results = new List<StandardSearchResult>();
var tasks = new List<Task<StandardPropertyLogicalGroupSearchResult>>();
foreach (var group in searchGroups)
{
tasks.Add(validatorService.ValidateLogicGroupRecursivlyAsync(
propertySearchQuery, group.StandardPropertyLogicalGroup));
}
await Task.WhenAll(tasks);
results = tasks.Select(logicalGroupResultTask =>
new StandardSearchResult(logicalGroupResultTask.Result)).ToList();

The difference between parallel and async is this:
Parallel: Spin up multiple threads and divide the work over each thread
Async: Do the work in a non-blocking manner.
Whether this makes a difference depends on what it is that is blocking in the async-way. If you're doing work on the CPU, it's the CPU that is blocking you and therefore you will still end up with multiple threads. In case it's IO (or anything else besides the CPU, you will reuse the same thread)
For your particular example that means the following:
Parallel.ForEach => Spin up new threads for each item in the list (the nr of threads that are spun up is managed by the CLR) and execute each item on a different thread
async/await => Do this bit of work, but let me continue execution. Since you have many items, that means saying this multiple times. It depends now what the results:
If this bit of workis on the CPU, the effect is the same
Otherwise, you'll just use a single thread while the work is being done somewhere else

Mongodb C# driver: should I use both FindAysnc and ToListAsync in the same query [duplicate]

I am writing a very, very simple query which just gets a document from a collection according to its unique Id. After some frusteration (I am new to mongo and the async / await programming model), I figured this out:
IMongoCollection<TModel> collection = // ...
FindOptions<TModel> options = new FindOptions<TModel> { Limit = 1 };
IAsyncCursor<TModel> task = await collection.FindAsync(x => x.Id.Equals(id), options);
List<TModel> list = await task.ToListAsync();
TModel result = list.FirstOrDefault();
return result;
It works, great! But I keep seeing references to a "Find" method, and I worked this out:
IMongoCollection<TModel> collection = // ...
IFindFluent<TModel, TModel> findFluent = collection.Find(x => x.Id == id);
findFluent = findFluent.Limit(1);
TModel result = await findFluent.FirstOrDefaultAsync();
return result;
As it turns out, this too works, great!
I'm sure that there's some important reason that we have two different ways to achieve these results. What is the difference between these methodologies, and why should I choose one over the other?

The difference is in a syntax.
Find and FindAsync both allows to build asynchronous query with the same performance, only
FindAsync returns cursor which doesn't load all documents at once and provides you interface to retrieve documents one by one from DB cursor. It's helpful in case when query result is huge.
Find provides you more simple syntax through method ToListAsync where it inside retrieves documents from cursor and returns all documents at once.

Imagine that you execute this code in a web request, with invoking find method the thread of the request will be frozen until the database return results it's a synchron call, if it's a long database operation that takes seconds to complete, you will have one of the threads available to serve web request doing nothing simply waiting that database return the results, and wasting valuable resources (the number of threads in thread pool is limited).
With FindAsync, the thread of your web request will be free while is waiting the database for returning the results, this means that during the database call this thread is free to attend an another web request. When the database returns the result then the code continue execution.
For long operations like read/writes from file system, database operations, comunicate with another services, it's a good idea to use async calls. Because while you are waiting for the results, the threads are available for serve another web request. This is more scalable.
Take a look to this microsoft article https://msdn.microsoft.com/en-us/magazine/dn802603.aspx.

Observable.Range being repeated?

New to Rx -- I have a sequence that appears to be functioning correctly except for the fact that it appears to repeat.
I think I'm missing something around calls to Select() or SelectMany() that triggers the range to re-evaluate.
Explanation of Code & What I'm trying to Do
For all numbers, loop through a method that retrieves data (paged from a database).
Eventually, this data will be empty (I only want to keep processing while it retrieves data
For each of those records retrieved, I only want to process ones that should be processed
Of those that should be processed, I'd like to process up to x of them in parallel (according to a setting).
I want to wait until the entire sequence is completed to exit the method (hence the wait call at the end).
Problem With the Code Below
I run the code through with a data set that I know only has 1 item.
So, page 0 returns 1 item, and page 1 return 0 items.
My expectation is that the process runs once for the one item.
However, I see that both page 0 and 1 are called twice and the process thus runs twice.
I think this has something to do with a call that is causing the range to re-evaluate beginning from 0, but I can't figure out what that it is.
The Code
var query = Observable.Range(0, int.MaxValue)
.Select(pageNum =>
{
_etlLogger.Info("Calling GetResProfIDsToProcess with pageNum of {0}", pageNum);
return _recordsToProcessRetriever.GetResProfIDsToProcess(pageNum, _processorSettings.BatchSize);
})
.TakeWhile(resProfList => resProfList.Any())
.SelectMany(records => records.Where(x=> _determiner.ShouldProcess(x)))
.Select(resProf => Observable.Start(async () => await _schoolDataProcessor.ProcessSchoolsAsync(resProf)))
.Merge(maxConcurrent: _processorSettings.ParallelProperties)
.Do(async trackingRequests =>
{
await CreateRequests(trackingRequests.Result, createTrackingPayload);
var numberOfAttachments = SumOfRequestType(trackingRequests.Result, TrackingRecordRequestType.AttachSchool);
var numberOfDetachments = SumOfRequestType(trackingRequests.Result, TrackingRecordRequestType.DetachSchool);
var numberOfAssignmentTypeUpdates = SumOfRequestType(trackingRequests.Result,
TrackingRecordRequestType.UpdateAssignmentType);
_etlLogger.Info("Extractor generated {0} attachments, {1} detachments, and {2} assignment type changes.",
numberOfAttachments, numberOfDetachments, numberOfAssignmentTypeUpdates);
});
var subscription = query.Subscribe(
trackingRequests =>
{
//Nothing really needs to happen here. Technically we're just doing something when it's done.
},
() =>
{
_etlLogger.Info("Finished! Woohoo!");
});
await query.Wait();

This is because you subscribe to the sequence twice. Once at query.Subscribe(...) and again at query.Wait().
Observable.Range(0, int.MaxValue) is a cold observable. Every time you subscribe to it, it will be evaluated again. You could make the observable hot by publishing it with Publish(), then subscribe to it, and then Connect() and then Wait(). This does add a risk to get a InvalidOperationException if you call Wait() after the last element is already yielded. A better alternative is LastOrDefaultAsync().
That would get you something like this:
var connectable = query.Publish();
var subscription = connectable.Subscribe(...);
subscription = new CompositeDisposable(connectable.Connect(), subscription);
await connectable.LastOrDefaultAsync();
Or you can avoid await and return a task directly with ToTask() (do remove async from your method signature).
return connectable.LastOrDefaultAsync().ToTask();
Once converted to a task, you can synchronously wait for it with Wait() (do not confuse Task.Wait() with Observable.Wait()).
connectable.LastOrDefaultAsync().ToTask().Wait();
However, most likely you do not want to wait at all! Waiting in a async context makes little sense. What you should do it put the remaining of the code that needs to run after the sequence completes in the OnComplete() part of the subscription. If you have (clean-up) code that needs to run even when you unsubscribe (Dispose), consider Observable.Using or the Finally(...) method to ensure this code is ran.

As already mentioned the cause of the Observable.Range being repeated is the fact that you're subscribing twice - once with .Subscribe(...) and once with .Wait().
In this kind of circumstance I would go with a very simple blocking call to get the values. Just do this:
var results = query.ToArray().Wait();
The .ToArray() turns a multi-valued IObservable<T> into a single values IObservable<T[]>. The .Wait() turns this into T[]. It's the easy way to ensure only one subscription, blocking, and getting all of the values out.
In your case you may not need all values, but I think this is a good habit to get into.

Is it a bad practice to combine use of Task and IObservable in my C# application?

I've recently gotten into Rx and I'm using it to help me pull data from several APIs in a data mining application.
I have an interface that I implement for each API, which encapsulates common calls to each API, e.g.
public interface IMyApi {
IObservable<string> GetApiName(); //Cold feed for getting the API's name.
IObservable<int> GetNumberFeed(); //Hot feed of numbers from the API
}
My question is around cold IObservables vs Tasks. In my mind, a cold observable is basically a task, they operate in much the same way. It strikes me as strange to be 'abstracting' a Task away as a cold observable, when you could argue that a Task is all you need. Also using a cold observable to wrap Tasks hides the nature of the activity, since the signature looks the same as a hot observable.
Another way I could represent the above interface is:
public interface IMyApi {
Task<string> GetApiNameAsync(); //Async method for getting the API's name.
IObservable<int> GetNumberFeed(); //Hot feed of numbers from the API
}
Is there some conventional wisdom on why I shouldn't mix and match between Tasks and IObservables?
Edit: To clarify - I've read the other discussions posted and understand the relationship between Rx and TPL, but my concerns are mainly about whether or not it's safe to combine the two in an application and whether it can lead to bad practice or threading and scheduling pitfalls?

My question is around cold IObservables vs Tasks. In my mind, a cold observable is basically a task, they operate in much the same way
It is important to note that this is not the case, they are very different. Here's the core difference:
// Nothing happens here at all! Just like calling Enumerable.Range(0, 100000000)
// doesn't actually create a huge array, until I use foreach.
var myColdObservable = MakeANetworkRequestObservable();
// Two network requests made!
myColdObservable.Subscribe(x => /*...*/);
myColdObservable.Subscribe(x => /*...*/);
// Only ***one*** network request made, subscribers share the
// result
var myTaskObservable = MakeATask().ToObservable();
myTaskObservable.Subscribe(x => /*...*/);
myTaskObservable.Subscribe(x => /*...*/);
Why is this important? Several methods in Rx such as Retry depend on this behavior:
// Retries three times, then gives up
myColdObservable.Retry(3).Subscribe(x => /*...*/);
// Actually *never* retries, and is effectively the same as if the
// Retry were never there, since all three tries will get the same
// result!
myTaskObservable.Retry(3).Subscribe(x => /*...*/);
So in general, making your Observables as cold will generally make your life easier.
How can I make a Task Cold?
Use the Defer operator:
var obs = Observable.Defer(() => CreateATask().ToObservable());
// CreateATask called *twice* here
obs.Subscribe(/*...*/);
obs.Subscribe(/*...*/);

There's no problem mixing the models, and in fact even the Rx team has included many adaptive operators in Rx. For example, ToTask, ToObservable, SelectMany, DeferAsync, StartAsync, ToAsync, etc. You can even await an IObservable<T> within an async method.
The primary difference that should affect your decision is cardinality:
IObservable<T> is [0,∞]
Task<T> is [0,1]
So if you need to represent only a single return value, then strongly consider using Task<T>.

The difference between Task and IObservable is not hot vs. cold: Task-returning methods can pretty much be "cold" (return new Task on every call) or "hot" (always return the same Task), just like IObservables.
The difference between the two is that IObservable represents a sequence of results, while Task represents a single result.
So, in cases when you'll always have a single result (or an error), use Task, when you can have any number of results, use IObservable.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Converting a IEnumerable<T> to IObservable<T>, with maximum parallelism - c#

Related

System.Reactive Throttling an async method

Parallel or async ASP.NET Core C#

Mongodb C# driver: should I use both FindAysnc and ToListAsync in the same query [duplicate]

Observable.Range being repeated?

Is it a bad practice to combine use of Task and IObservable in my C# application?

Categories

Resources