Yield multiple IEnumerables - c#

I have an piece of code that does calculations on assets. There are many millions of those so I want to compute everything in streams. My current 'pipeline' looks like this:
I have a query that is executed as a Datareader.
Then my Asset class has a constructor that accepts an IDataReader;
Public Asset(IdataReader rdr){
// logic that initiates fields
}
and a method that converts the IDataReader to an IEnumerable<Asset>
public static IEnumerable<Asset> ToAssets(IDataReader rdr) {
// make sure the reader is in the right formt
CheckReaderFormat(rdr);
// project reader into IEnumeable<Asset>
while (rdr.Read()) yield return new Asset(rdr);
}
That then gets passed into a function that does the actually calculations and then projects it into a IEnumerable<Asnwer>
That then gets a wrapper the exposes the Answers as an IDataReader and that then that gets passed to a OracleBulkCopy and the stream is written to the DB.
So far it works like a charm. Because of the setup I can swap the DataReader for an IEnumerable that reads from a file, or have the results written to a file etc. All depending on how I string the classes/ functions together.
Now: There are several thing I can compute, for instance besides the normal Answer I could have a DebugAnswer class that also outputs some intermediate numbers for debugging. So what I would like to do is project the IEnumerable into several output streams so I can put 'listeners' on those. That way I won't have to go over the data multiple times. How can I do that? Kind of like having several Events and then only fire certain code if there's a listeners attached.
Also sometimes I write to the DB but also to a zipfile just to keep a backup of the results. So then I would like to have 2 'listeners' on the IEnumerable. One that projects is as an IDataReader and another one that writes straight to the file.
How do I output multiple output streams and how can I put multiple listeners on one outputstream? What lets me compose streams of data like that?
edit
so some pseudocode of what I would like to do:
foreach(Asset in Assets){
if(DebugListener != null){
// compute
DebugAnswer da = new DebugAnswer {result = 100};
yield da to DebugListener; // so instead of yield return yield to that stream
}
if(AnswerListener != null){
// compute basic stuff
Answer a = new Answer { bla = 200 };
yield a to AnswerListener;
}
}
Thanks in advance,
Gert-Jan

What you're describing sounds sort of like what the Reactive framework provides via the IObservable interface, but I don't know for sure whether it allows multiple subscribers to a single subscription stream.
Update
If you take a look at the documentation for IObservable, it has a pretty good example of how to do the sort of thing you're doing, with multiple subscribers to a single object.

Your example rewritten using Rx:
// The stream of assets
IObservable<Asset> assets = ...
// The stream of each asset projected to a DebugAnswer
IObservable<DebugAnswer> debugAnswers = from asset in assets
select new DebugAnswer { result = 100 };
// Subscribe the DebugListener to receive the debugAnswers
debugAnswers.Subscribe(DebugListener);
// The stream of each asset projected to an Anwer
IObservable<Answer> answers = from asset in assets
select new Answer { bla = 200 };
// Subscribe the AnswerListener to receive the answers
answers.Subscribe(AnswerListener);

This is exactly the job for Reactive Extensions (became part of .NET since 4.0, available as a library in 3.5).

You don't need multiple "listeners", you just need pipeline components that aren't destructive or even necessarily transformable.
IEnumerable<T> PassThroughEnumerable<T>(IEnumerable<T> source, Action<T> action) {
foreach (T t in source) {
Action(t);
yield return t;
}
}
Or, as you're processing in the pipeline just raise some events to be consumed. You can async them if you want:
static IEnumerable<Asset> ToAssets(IDataReader rdr) {
CheckReaderFormat(rdr);
var h = this.DebugAsset;
while (rdr.Read()) {
var a = new Asset(rdr);
if (h != null) h(a);
yield return a;
}
}
public event EventHandler<Asset> DebugAsset;

If I got you right, it should be possible to replace or decorate the wrapper. The WrapperDecorator may forward calls to the normal OracleBulkCopy (or whatever you're using) and add some custom debug code.
Does that help you?
Matthias

Related

Is there a way to avoid using side effects to process this data

I have an application I'm writing that runs script plugins to automate what a user used to have to do manually through a serial terminal. So, I am basically implementing the serial terminal's functionality in code. One of the functions of the terminal was to send a command which kicked off the terminal receiving continuously streamed data from a device until the user pressed space bar, which would then stop the streaming of the data. While the data was streaming, the user would then set some values in another application on some other devices and watch the data streamed in the terminal change.
Now, the streamed data can take different shapes, depending on the particular command that's sent. For instance, one response may look like:
---RESPONSE HEADER---
HERE: 1
ARE: 2 SOME:3
VALUES: 4
---RESPONSE HEADER---
HERE: 5
ARE: 6 SOME:7
VALUES: 8
....
another may look like:
here are some values
in cols and rows
....
So, my idea is to have a different parser based on the command I send. So, I have done the following:
public class Terminal
{
private SerialPort port;
private IResponseHandler pollingResponseHandler;
private object locker = new object();
private List<Response1Clazz> response1;
private List<Response2Clazz> response2;
//setter omited for brevity
//get snapshot of data at any point in time while response is polling.
public List<Response1Clazz> Response1 {get { lock (locker) return new List<Response1Clazz>(response1); }
//setter omited for brevity
public List<Response2Clazz> Response2 {get { lock (locker) return new List<Response1Clazz>(response2); }
public Terminal()
{
port = new SerialPort(){/*initialize data*/}; //open port etc etc
}
void StartResponse1Polling()
{
Response1 = new List<Response1Clazz>();
Parser<List<Response1Clazz>> parser = new KeyValueParser(Response1); //parser is of type T
pollingResponseHandler = new PollingResponseHandler(parser);
//write command to start polling response 1 in a task
}
void StartResponse2Polling()
{
Response2 = new List<Response2Clazz>();
Parser<List<Response2Clazz>> parser = new RowColumnParser(Response2); //parser is of type T
pollingResponseHandler = new PollingResponseHandler(parser); // this accepts a parser of type T
//write command to start polling response 2
}
OnSerialDataReceived(object sender, Args a)
{
lock(locker){
//do some processing yada yada
//we pass in the serial data to the handler, which in turn delegates to the parser.
pollingResponseHandler.Handle(processedSerialData);
}
}
}
the caller of the class would then be something like
public class Plugin : BasePlugin
{
public override void PluginMain()
{
Terminal terminal = new Terminal();
terminal.StartResponse1Polling();
//update some other data;
Response1Clazz response = terminal.Response1;
//process response
//update more data
response = terminal.Response1;
//process response
//terminal1.StopPolling();
}
}
My question is quite general, but I'm wondering if this is the best way to handle the situation. Right now I am required to pass in an object/List that I want modified, and it's modified via a side effect. For some reason this feels a little ugly because there is really no indication in code that this is what is happening. I am purely doing it because the "Start" method is the location that knows which parser to create and which data to update. Maybe this is Kosher, but I figured it is worth asking if there is another/better way. Or at least a better way to indicate that the "Handle" method produces side effects.
Thanks!
I don't see problems in modifying List<>s that are received as a parameter. It isn't the most beautiful thing in the world but it is quite common. Sadly C# doesn't have a const modifier for parameters (compare this with C/C++, where unless you declare a parameter to be const, it is ok for the method to modify it). You only have to give the parameter a self-explaining name (like outputList), and put a comment on the method (you know, an xml-comment block, like /// <param name="outputList">This list will receive...</param>).
To give a more complete response, I would need to see the whole code. You have omitted an example of Parser and an example of Handler.
Instead I see a problem with your lock in { lock (locker) return new List<Response1Clazz>(response1); }. And it seems to be non-sense, considering that you then do Response1 = new List<Response1Clazz>();, but Response1 only has a getter.

Executing part of code exactly 1 time inside Parallel.ForEach

I have to query in my company's CRM Solution(Oracle's Right Now) for our 600k users, and update them there if they exist or create them in case they don't. To know if the user already exists in Right Now, I consume a third party WS. And with 600k users this can be a real pain due to the time it takes each time to get a response(around 1 second). So I managed to change my code to use Parallel.ForEach, querying each record in just 0,35 seconds, and adding it to a List<User> of records to be created or to be updated (Right Now is kinda dumb so I need to separate them in 2 lists and call 2 distinct WS methods).
My code used to run perfectly before multithread, but took too long. The problem is that I can't make a batch too large or I get a timeout when I try to update or create via Web Service. So I'm sending them around 500 records at once, and when it runs the critical code part, it executes many times.
Parallel.ForEach(boDS.USERS.AsEnumerable(), new ParallelOptions { MaxDegreeOfParallelism = -1 }, row =>
{
...
user = null;
user = QueryUserById(row["USER_ID"].Trim());
if (user == null)
{
isUpdate = false;
gObject.ID = new ID();
}
else
{
isUpdate = true;
gObject.ID = user.ID;
}
... fill user attributes as generic fields ...
gObject.GenericFields = listGenericFields.ToArray();
if (isUpdate)
listUserUpdate.Add(gObject);
else
listUserCreate.Add(gObject);
if (i == batchSize - 1 || i == (boDS.USERS.Rows.Count - 1))
{
UpdateProcessingOptions upo = new UpdateProcessingOptions();
CreateProcessingOptions cpo = new CreateProcessingOptions();
upo.SuppressExternalEvents = false;
upo.SuppressRules = false;
cpo.SuppressExternalEvents = false;
cpo.SuppressRules = false;
RNObject[] results = null;
// <Critical_code>
if (listUserCreate.Count > 0)
{
results = _service.Create(_clientInfoHeader, listUserCreate.ToArray(), cpo);
}
if (listUserUpdate.Count > 0)
{
_service.Update(_clientInfoHeader, listUserUpdate.ToArray(), upo);
}
// </Critical_code>
listUserUpdate = new List<RNObject>();
listUserCreate = new List<RNObject>();
}
i++;
});
I thought about using lock or mutex, but it isn't gonna help me, since they will just wait to execute afterwards. I need some solution to execute only ONCE in only ONE thread that part of code. Is it possible? Can anyone share some light?
Thanks and kind regards,
Leandro
As you stated in the comments you're declaring the variables outside of the loop body. That's where your race conditions originate from.
Let's take variable listUserUpdate for example. It's accessed randomly by parallel executing threads. While one thread is still adding to it, e.g. in listUserUpdate.Add(gObject); another thread could already be resetting the lists in listUserUpdate = new List<RNObject>(); or enumerating it in listUserUpdate.ToArray().
You really need to refactor that code to
make each loop run as independent from each other as you can by moving variables inside the loop body and
access data in a synchronizing way using locks and/or concurrent collections
You can use the Double-checked locking pattern. This is usually used for singletons, but you're not making a singleton here so generic singletons like Lazy<T> do not apply.
It works like this:
Separate out your shared data into some sort of class:
class QuerySharedData {
// All the write-once-read-many fields that need to be shared between threads
public QuerySharedData() {
// Compute all the write-once-read-many fields. Or use a static Create method if that's handy.
}
}
In your outer class add the following:
object padlock;
volatile QuerySharedData data
In your thread's callback delegate, do this:
if (data == null)
{
lock (padlock)
{
if (data == null)
{
data = new QuerySharedData(); // this does all the work to initialize the shared fields
}
}
}
var localData = data
Then use the shared query data from localData By grouping the shared query data into a subordinate class you avoid the necessity of making its individual fields volatile.
More about volatile here: Part 4: Advanced Threading.
Update my assumption here is that all the classes and fields held by QuerySharedData are read-only once initialized. If this is not true, for instance if you initialize a list once but add to it in many threads, this pattern will not work for you. You will have to consider using things like Thread-Safe Collections.

Windows 8 Store App and Linq

The snippet below is from a Windows 8 store app in c# and xaml.
I have put this code together from variou samples on the web so this may not be the neatest way of doing this. Most of it is from the Grid template supplied in VS2012 and I have hooked up my web api as the source of the data
Please explain the following
When i call the Get method all works fine and i get data back into the xaml view
When i uncomment the Take(10) in the same method i get no data back.
It seems any attempt to put an extension method of a LINQ variety just stops the data being returned and also gives no indication why, it complies fine!
Any help appreciated
Thanks
Mark
public class TeamDataSource
{
private static TeamDataSource _sampleDataSource = new TeamDataSource();
private ObservableCollection<TeamDataItem> _items = new ObservableCollection<TeamDataItem>();
public ObservableCollection<TeamDataItem> Items
{
get { return this._items; }
}
public TeamDataSource()
{
this.Initialize();
}
public static IEnumerable<TeamDataItem> Get()
{
var thisdata = _sampleDataSource.Items;
return thisdata;//.Take(10);
}
private async void Initialize()
{
using (var client = new DataServiceClient())
{
List<TeamDataItem> list = await client.Download<List<TeamDataItem>>("/teams");
foreach (var i in list.OrderByDescending(t => t.Points).ThenByDescending(t => t.GoalDiff))
{
TeamDataItem team = i;
_items.Add(team);
}
}
}
}
Your problem is that Take doesn't immediately enumerate the items. It defers enumeration until either foreach is called on it or GetEnumerator is called on it. In this case the collection it is enumerating is disposed (as soon as the Get content ends) and so when it finally enumerates the items, there are no items anymore. Try adding thisdata.GetEnumerator(); as a line before your return statement.
From here:
This method is implemented by using deferred execution. The immediate
return value is an object that stores all the information that is
required to perform the action. The query represented by this method
is not executed until the object is enumerated either by calling its
GetEnumerator method directly or by using foreach in Visual C# or For
Each in Visual Basic.
Seems it was quite obvious in the end. As I was using aync and await, the call was immediately returning before the data had arrived. Therefore nothing for the Take(4) to work on.
Only problem now is when can i tell the task has completed?

How can I combine two streams ordered then grouped by timestamp?

I have two streams of objects that each have a Timestamp value. Both streams are in order, so for example the timestamps might be Ta = 1,3,6,6,7 in one stream and Tb = 1,2,5,5,6,8 in the other. Objects in both streams are of the same type.
What I'd like to be able to do is to put each of these events on the bus in order of timestamp, i.e., put A1, then B1, B2, A3 and so on. Furthermore, since some streams have several (sequential) elements with the same timestamp, I want those elements grouped so that each new event is an array. So we would put [A3] on the bus, followed by [A15,A25] and so on.
I've tried to implement this by making two ConcurrentQueue structures, putting each event at the back of the queue, then looking at each front of the queue, choosing first the earlier event and then traversing the queue such that all events with this timestamp are present.
However, I've encountered two problems:
If I leave these queues unbounded, I quickly run out of memory as the read op is a lot faster than the handlers receiving the events. (I've got a few gigabytes of data).
I sometimes end up with a situation where I handle the event, say, A15 before A25 has arrived. I somehow need to guard against this.
I'm thinking that Rx can help in this regard but I don't see an obvious combinator(s) to make this possible. Thus, any advice is much appreciated.
Rx is indeed a good fit for this problem IMO.
IObservables can't 'OrderBy' for obvious reasons (you would have to observe the entire stream first to guarantee the correct output order), so my answer below makes the assumption (that you stated) that your 2 source event streams are in order.
It was an interesting problem in the end. The standard Rx operators are missing a GroupByUntilChanged that would have solved this easily, as long as it called OnComplete on the previous group observable when the first element of the next group was observed. However looking at the implementation of DistinctUntilChanged it doesn't follow this pattern and only calls OnComplete when the source observable completes (even though it knows there will be no more elements after the first non-distinct element... weird???). Anyway, for those reasons, I decided against a GroupByUntilChanged method (to not break Rx conventions) and went instead for a ToEnumerableUntilChanged.
Disclaimer: This is my first Rx extension so would appreciate feedback on my choices made. Also, one main concern of mine is the anonymous observable holding the distinctElements list.
Firstly, your application code is quite simple:
public class Event
{
public DateTime Timestamp { get; set; }
}
private IObservable<Event> eventStream1;
private IObservable<Event> eventStream2;
public IObservable<IEnumerable<Event>> CombineAndGroup()
{
return eventStream1.CombineLatest(eventStream2, (e1, e2) => e1.Timestamp < e2.Timestamp ? e1 : e2)
.ToEnumerableUntilChanged(e => e.Timestamp);
}
Now for the ToEnumerableUntilChanged implementation (wall of code warning):
public static IObservable<IEnumerable<TSource>> ToEnumerableUntilChanged<TSource,TKey>(this IObservable<TSource> source, Func<TSource,TKey> keySelector)
{
// TODO: Follow Rx conventions and create a superset overload that takes the IComparer as a parameter
var comparer = EqualityComparer<TKey>.Default;
return Observable.Create<IEnumerable<TSource>>(observer =>
{
var currentKey = default(TKey);
var hasCurrentKey = false;
var distinctElements = new List<TSource>();
return source.Subscribe((value =>
{
TKey elementKey;
try
{
elementKey = keySelector(value);
}
catch (Exception ex)
{
observer.OnError(ex);
return;
}
if (!hasCurrentKey)
{
hasCurrentKey = true;
currentKey = elementKey;
distinctElements.Add(value);
return;
}
bool keysMatch;
try
{
keysMatch = comparer.Equals(currentKey, elementKey);
}
catch (Exception ex)
{
observer.OnError(ex);
return;
}
if (keysMatch)
{
distinctElements.Add(value);
return;
}
observer.OnNext( distinctElements);
distinctElements.Clear();
distinctElements.Add(value);
currentKey = elementKey;
}), observer.OnError, () =>
{
if (distinctElements.Count > 0)
observer.OnNext(distinctElements);
observer.OnCompleted();
});
});
}

Caching attribute for method?

Maybe this is dreaming, but is it possible to create an attribute that caches the output of a function (say, in HttpRuntime.Cache) and returns the value from the cache instead of actually executing the function when the parameters to the function are the same?
When I say function, I'm talking about any function, whether it fetches data from a DB, whether it adds two integers, or whether it spits out the content of a file. Any function.
Your best bet is Postsharp. I have no idea if they have what you need, but that's certainly worth checking. By the way, make sure to publish the answer here if you find one.
EDIT: also, googling "postsharp caching" gives some links, like this one: Caching with C#, AOP and PostSharp
UPDATE: I recently stumbled upon this article: Introducing Attribute Based Caching. It describes a postsharp-based library on http://cache.codeplex.com/ if you are still looking for a solution.
I have just the same problem - I have multiply expensive methods in my app and it is necessary for me to cache those results. Some time ago I just copy-pasted similar code but then I decided to factor this logic out of my domain.
This is how I did it before:
static List<News> _topNews = null;
static DateTime _topNewsLastUpdateTime = DateTime.MinValue;
const int CacheTime = 5; // In minutes
public IList<News> GetTopNews()
{
if (_topNewsLastUpdateTime.AddMinutes(CacheTime) < DateTime.Now)
{
_topNews = GetList(TopNewsCount);
}
return _topNews;
}
And that is how I can write it now:
public IList<News> GetTopNews()
{
return Cacher.GetFromCache(() => GetList(TopNewsCount));
}
Cacher - is a simple helper class, here it is:
public static class Cacher
{
const int CacheTime = 5; // In minutes
static Dictionary<long, CacheItem> _cachedResults = new Dictionary<long, CacheItem>();
public static T GetFromCache<T>(Func<T> action)
{
long code = action.GetHashCode();
if (!_cachedResults.ContainsKey(code))
{
lock (_cachedResults)
{
if (!_cachedResults.ContainsKey(code))
{
_cachedResults.Add(code, new CacheItem { LastUpdateTime = DateTime.MinValue });
}
}
}
CacheItem item = _cachedResults[code];
if (item.LastUpdateTime.AddMinutes(CacheTime) >= DateTime.Now)
{
return (T)item.Result;
}
T result = action();
_cachedResults[code] = new CacheItem
{
LastUpdateTime = DateTime.Now,
Result = result
};
return result;
}
}
class CacheItem
{
public DateTime LastUpdateTime { get; set; }
public object Result { get; set; }
}
A few words about Cacher. You might notice that I don't use Monitor.Enter() ( lock(...) ) while computing results. It's because copying CacheItem pointer ( return (T)_cachedResults[code].Result; line) is thread safe operation - it is performed by only one stroke. Also it is ok if more than one thread will change this pointer at the same time - they all will be valid.
You could add a dictionary to your class using a comma separated string including the function name as the key, and the result as the value. Then when your functions can check the dictionary for the existence of that value. Save the dictionary in the cache so that it exists for all users.
PostSharp is your one stop shop for this if you want to create a [Cache] attribute (or similar) that you can stick on any method anywhere. Previously when I used PostSharp I could never get past how slow it made my builds (this was back in 2007ish, so this might not be relevant anymore).
An alternate solution is to look into using Render.Partial with ASP.NET MVC in combination with OutputCaching. This is a great solution for serving html for widgets / page regions.
Another solution that would be with MVC would be to implement your [Cache] attribute as an ActionFilterAttribute. This would allow you to take a controller method and tag it to be cached. It would only work for controller methods since the AOP magic only can occur with the ActionFilterAttributes during the MVC pipeline.
Implementing AOP through ActionFilterAttribute has evolved to be the goto solution for my shop.
AFAIK, frankly, no.
But this would be quite an undertaking to implement within the framework in order for it to work generically for everybody in all circumstances, anyway - you could, however, tailor something quite sufficient to needs by simply (where simplicity is relative to needs, obviously) using abstraction, inheritance and the existing ASP.NET Cache.
If you don't need attribute configuration but accept code configuration, maybe MbCache is what you're looking for?

Categories