Publisher/Subscriber in LINQ?

Publisher/Subscriber in LINQ? - c#

Problem:
IEnumerable<Signal> feed = GetFeed();
var average1 = feed.MovingAverage(10);
var average2 = feed.MovingAverage(20);
var zipped = average1.Zip(average2, (x,y) => Tuple.Create(x,y));
When I iterate through "zipped", GetFeed().GetEnumerator() gets called twice and creates all sorts of synchronization issues. Is there a LINQ operator that can be used to broadcast values from single producer to multiple consumers? I know about Memoize, but in my case I can't predict buffer size to keep slow and fast consumers "happy".
I am thinking about writing my own operator that would keep separate queues for each consumer, but wanted to check if there is an existing solution.

What you want is Reactive Extensions. It's like LINQ to Objects, but in reverse: you don't pull values, they're pushed through observers.
It takes a little while to get used to it, but judging by what you've posted, it's exactly the right model for you.

Related

How to safely iterate over an IAsyncEnumerable to send a collection downstream for message processing in batches

I've watched the chat on LINQ with IAsyncEnumerable which has given me some insight on dealing with extension methods for IAsyncEnumerables, but wasn't detailed enough frankly for a real-world application, especially for my experience level, and I understand that samples/documentation don't really exist as of yet for IAsyncEnumerables
I'm trying to read from a file, do some transformation on the stream, returning a IAsyncEnumerable, and then send those objects downstream after an arbitrary number of objects have been obtained, like:
await foreach (var data in ProcessBlob(downloadedFile))
{
//todo add data to List<T> called listWithPreConfiguredNumberOfElements
if (listWithPreConfiguredNumberOfElements.Count == preConfiguredNumber)
await _messageHandler.Handle(listWithPreConfiguredNumberOfElements);
//repeat the behaviour till all the elements in the IAsyncEnumerable returned by ProcessBlob are sent downstream to the _messageHandler.
}
My understanding from reading on the matter so far is that the await foreach line is working on data that employs the use of Tasks (or ValueTasks), so we don't have a count up front. I'm also hesitant to use a List variable and just do a length-check on that as sharing that data across threads doesn't seem very thread-safe.
I'm using the System.Linq.Async package in the hopes that I could use a relevant extensions method. I can see some promise in the form of TakeWhile, but my understanding on how thread-safe the task I intend to do is not all there, causing me to lose confidence.
Any help or push in the right direction would be massively appreciated, thank you.

There is an operator Buffer that does what you want, in the package System.Interactive.Async.
// Projects each element of an async-enumerable sequence into consecutive
// non-overlapping buffers which are produced based on element count information.
public static IAsyncEnumerable<IList<TSource>> Buffer<TSource>(
this IAsyncEnumerable<TSource> source, int count);
This package contains operators like Amb, Throw, Catch, Defer, Finally etc that do not have a direct equivalent in Linq, but they do have an equivalent in System.Reactive. This is because IAsyncEnumerables are conceptually closer to IObservables than to IEnumerables (because both have a time dimension, while IEnumerables are timeless).

I'm also hesitant to use a List variable and just do a length-check on that as sharing that data across threads doesn't seem very thread-safe.
You need to think in terms of execution flows, not threads, when dealing with async; since you are await-ing the processing step, there isn't actually a concurrency problem accessing the list, because regardless of which threads are used: the list is only accessed once at a time.
If you are still concerned, you could new a list per batch, but that is probably overkill. What you do need, however, is two additions - a reset between batches, and a final processing step:
var listWithPreConfiguredNumberOfElements = new List<YourType>(preConfiguredNumber);
await foreach (var data in ProcessBlob(downloadedFile)) // CAF?
{
listWithPreConfiguredNumberOfElements.Add(data);
if (listWithPreConfiguredNumberOfElements.Count == preConfiguredNumber)
{
await _messageHandler.Handle(listWithPreConfiguredNumberOfElements); // CAF?
listWithPreConfiguredNumberOfElements.Clear(); // reset for a new batch
// (replace this with a "new" if you're still concerned about concurrency)
}
}
if (listWithPreConfiguredNumberOfElements.Any())
{ // process any stragglers
await _messageHandler.Handle(listWithPreConfiguredNumberOfElements); // CAF?
}
You might also choose to use ConfigureAwait(false) in the three spots marked // CAF?

How would I write this code with Reactive Programming?

I just started messing around with reactive programming, and I know just enough to write code but not enough to figure out what's happening when I don't get what I expect. I don't really have a mentor available other than blog posts. I haven't found a very good solution to a situation I'm having, and I'm curious about the right approach.
The problem:
I need to get a Foo, which is partially composed of an array of Bar objects. I fetch the Bar objects from web services. So I represented each web service call as an IObservable from which I expect 0 or 1 elements before completion. I want to make an IObservable that will:
Subscribe to each of the IObservable instances.
Wait for up to a 2 second Timeout.
When either both sequences complete or the timeout happens:
Create an array with any of the Bar objects that were generated (there may be 0.)
Produce the Foo object using that Bar[].
I sort of accomplished this with this bit of code:
public Foo CreateFoo() {
var producer1 = webService.BarGenerator()
.Timeout(TimeSpan.FromSeconds(2), Observable.Empty<Bar>());
var producer2 = // similar to above
var pipe = producer1.Concat(producer2);
Bar[] result = pipe.ToEnumerable().ToArray();
...
}
That doesn't seem right, for a lot of reasons. The most obvious is Concat() will start the sequences serially rather than in parallel, so that's a 4-second timeout. I don't really care that it blocks, it's actually convenient for the architecture I'm working with that it does. I'm fine with this method becoming a generator of IObservable, but there's a few extra caveats here that seem to make that challenging when I try:
I need the final array to put producer1 and producer2's result in that order, if they both produce a result.
I'd like to use a TestScheduler to verify the timeout but haven't succeeded at that yet, I apparently don't understand schedulers at all.
This is, ultimately, a pull model, whatever gets the Foo needs it at a distinct point and there's no value to receiving it 'on the fly'. Maybe this tilts the answer to "Don't use Rx". To be honest, I got stuck enough I switched to a Task-based API. But I want to see how one might approach this with Rx, because I want to learn.

var pipe = producer1
.Merge(producer2)
.Buffer(Observable.Timer(TimeSpan.FromSeconds(2), testScheduler))
.Take(1);
var subscription = pipe
.Select(list => new Foo(list.ToArray())
.Subscribe(foo => {} /* Do whatever you want with your foo here.*/);
Buffer takes all elements emitted during a window (in our case in two seconds), and outputs a list.
If you want to stick with your pull model, instead of a subscription you could do:
var list = await pipe;
var foo = new Foo(list.ToArray());
//....

Does foreach loop work more slowly when used with a not stored list or array?

I am wondered at if foreach loop works slowly if an unstored list or array is used as an in array or List.
I mean like that:
foreach (int number in list.OrderBy(x => x.Value)
{
// DoSomething();
}
Does the loop in this code calculates the sorting every iteration or not?
The loop using stored value:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>;
foreach (int number in list)
{
// DoSomething();
}
And if it does, which code shows the better performance, storing the value or not?

This is often counter-intuitive, but generally speaking, the option that is best for performance is to wait as long as possible to materialize results into a concrete structure like a list or array. Please keep in mind that this is a generalization, and so there are plenty of cases where it doesn't hold. Nevertheless, the first instinct is better when you avoid creating the list for as long as possible.
To demonstrate with your sample, we have these two options:
var list = tours.OrderBy(x => x.Value).ToList();
foreach (int number in list)
{
// DoSomething();
}
vs this option:
foreach (int number in list.OrderBy(x => x.Value))
{
// DoSomething();
}
To understand what is going on here, you need to look at the .OrderBy() extension method. Reading the linked documentation, you'll see it returns a IOrderedEnumerable<TSource> object. With an IOrderedEnumerable, all of the sorting needed for the foreach loop is already finished when you first start iterating over the object (and that, I believe, is the crux of your question: No, it does not re-sort on each iteration). Also note that both samples use the same OrderBy() call. Therefore, both samples have the same problem to solve for ordering the results, and they accomplish it the same way, meaning they take exactly the same amount of time to reach that point in the code.
The difference in the code samples, then, is entirely in using the foreach loop directly vs first calling .ToList(), because in both cases we start from an IOrderedEnumerable. Let's look closely at those differences.
When you call .ToList(), what do you think happens? This method is not magic. There is still code here which must execute in order to produce the list. This code still effectively uses it's own foreach loop that you can't see. Additionally, where once you only needed to worry about enough RAM to handle one object at a time, you are now forcing your program to allocate a new block of RAM large enough to hold references for the entire collection. Moving beyond references, you may also potentially need to create new memory allocations for the full objects, if you were reading a from a stream or database reader before that really only needed one object in RAM at a time. This is an especially big deal on systems where memory is the primary constraint, which is often the case with web servers, where you may be serving and maintaining session RAM for many many sessions, but each session only occasionally uses any CPU time to request a new page.
Now I am making one assumption here, that you are working with something that is not already a list. What I mean by this, is the previous paragraphs talked about needing to convert an IOrderedEnumerable into a List, but not about converting a List into some form of IEnumerable. I need to admit that there is some small overhead in creating and operating the state machine that .Net uses to implement those objects. However, I think this is a good assumption. It turns out to be true far more often than we realize. Even in the samples for this question, we're paying this cost regardless, by the simple virtual of calling the OrderBy() function.
In summary, there can be some additional overhead in using a raw IEnumerable vs converting to a List, but there probably isn't. Additionally, you are almost certainly saving yourself some RAM by avoiding the conversions to List whenever possible... potentially a lot of RAM.

Yes and no.
Yes the foreach statement will seem to work slower.
No your program has the same total amount of work to do so you will not be able to measure a difference from the outside.
What you need to focus on is not using a lazy operation (in this case OrderBy) multiple times without a .ToList or ToArray. In this case you are only using it once(foreach) but it is an easy thing to miss.
Edit: Just to be clear. The as statement in the question will not work as intended but my answer assumes no .ToList() after OrderBy .

This line won't run:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>; // Returns null.
Instead, you want to store the results this way:
List<Tour> list = tours.OrderBy(x => x.Value).ToList();
And yes, the second option (storing the results) will enumerate much faster as it will skip the sorting operation.

What is the fastest way of changing Dictionary<K,V>?

This is an algorithmic question.
I have got Dictionary<object,Queue<object>>. Each queue contains one or more elements in it. I want to remove all queues with only one element from the dictionary. What is the fastest way to do it?
Pseudo-code: foreach(item in dict) if(item.Length==1) dict.Remove(item);
It is easy to do it in a loop (not foreach, of course), but I'd like to know which approach is the fastest one here.
Why I want it: I use that dictionary to find duplicate elements in a large set of objects. The Key in dictionary is kind of a hash of the object, the Value is a queue of all objects found with the same hash. Since I want only duplicates, I need to remove all items with just a single object in associated queue.
Update:
It may be important to know that in a regular case there are just a few duplicates in a large set of objects. Let's assume 1% or less. So possibly it could be faster to leave the Dictionary as is and create a new one from scatch with just selected elements from the first one... and then deelte the first Dictionary completely. I think it depends on the comlpexity of computational Dictionary class's methods used in particular algorithms.
I really want to see this problem on a theoretical level because as a teacher I want to discuss it with students. I didn't provide any concrete solution myself because I think it is really easy to do it. The question is which approach is the best, the fastest.

var itemsWithOneEntry = dict.Where(x => x.Value.Count == 1)
.Select(x => x.Key)
.ToList();
foreach (var item in itemsWithOneEntry) {
dict.Remove(item));
}

It stead of trying to optimize the traversing of the collection how about optimizing the content of the collection so that it only includes the duplicates? This would require changing your collection algorithm instead to something like this
var duplicates = new Dictionary<object,Queue<object>>;
var possibleDuplicates = new Dictionary<object,object>();
foreach(var item in original){
if(possibleDuplicates.ContainsKey(item)){
duplicates.Add(item, new Queue<object>{possibleDuplicates[item],item});
possibleDuplicates.Remove(item);
} else if(duplicates.ContainsKey(item)){
duplicates[item].Add(item);
} else {
possibleDuplicates.Add(item);
}
}

Note that you should probably measure the impact of this on the performance in a realistic scenario before you bother to make your code any more complex than it really needs to be. Most imagined performance problems are not in fact the real cause of slow code.
But supposing you do find that you could get a speed advantage by avoiding a linear search for queues of length 1, you could solve this problem with a technique called indexing.
As well as your dictionary containing all the queues, you maintain an index container (probably another dictionary) that only contains the queues of length 1, so when you need them they are already available separately.
To do this, you need to enhance all the operations that modify the length of the queue, so that they have the side-effect of updating the index container.
One way to do it is to define a class ObservableQueue. This would be a thin wrapper around Queue except it also has a ContentsChanged event that fires when the number of items in the queue changes. Use ObservableQueue everywhere instead of the plain Queue.
Then when you create a new queue, enlist on its ContentsChanged event a handler that checks to see if the queue only has one item. Based on this you can either insert or remove it from the index container.

.NET queue ElementAt performance

I'm having a hard time with parts of my code:
private void UpdateOutputBuffer()
{
T[] OutputField = new T[DisplayedLength];
int temp = 0;
int Count = HistoryQueue.Count;
int Sample = 0;
//Then fill the useful part with samples from the queue
for (temp = DisplayStart; temp != DisplayStart + DisplayedLength && temp < Count; temp++)
{
OutputField[Sample++] = HistoryQueue.ElementAt(Count - temp - 1);
}
DisplayedHistory = OutputField;
}
It takes most of the time in the program. The number of elements in HistoryQueue is 200k+. Could this be because the queue in .NET is implemented internally as a linked list?
What would be a better way of going about this? Basically, the class should act like a FIFO that starts dropping elements at ~500k samples and I could pick DisplayedLength elements and put them into OutputField. I was thinking of writing my own Queue that would use a circular buffer.
The code worked fine for count lower values. DisplayedLength is 500.
Thank you,
David

Queue does not have an ElementAt method. I'm guessing you are getting this via Linq, and that it is simply doing a forced iteration over n elements until it gets to the desired index. This is obviously going to slow down as the collection gets bigger. If ElementAt represents a common access pattern, then pick a data structure that can be accessed via index e.g. an Array.

Yes, the linked-list-ness is almost certainly the problem. There's a reason why Queue<T> doesn't implement IList<T> :) (Having said that, I believe Stack<T> is implemented using an array, and that still doesn't implement IList<T>. It could provide efficient random access, but it doesn't.)
I can't easily tell which portion of the queue you're trying to display, but I strongly suspect that you could simplify the method and make it more efficient using something like:
T[] outputField = HistoryQueue.Skip(...) /* adjust to suit requirements... */
.Take(DisplayedLength)
.Reverse()
.ToArray();
That's still going to have to skip over a huge number of items individually, but at least it will only have to do it once.
Have you thought of using a LinkedList<T> directly? That would make it a lot easier to read items from the end of the list really easily.
Building your own bounded queue using a circular buffer wouldn't be hard, of course, and may well be the better solution in the long run.

Absolutely the wrong data structure to use here. ElementAt is O(n), which makes your loop O(n2). You should use something else instead of a Queue.

Personally I don't think a queue is what you're looking for, but your access pattern is even worse. Use iterators if you want sequential access:
foreach(var h in HistoryQueue.Skip(DisplayStart).Take(DisplayedLength).Reverse())
// work with h

If you need to be able to pop/push at either end and have indexed access you really need an implementation of Deque (multiple array form). While there is no implementation in the BCL, there are plenty of third party ones (to get started, if needed you could implement your own later).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.