i am using rx distinct operator to filter external data stream based on a certain key within a long running process.
will this cause leak in the memory? Assuming a lot of different keys will be received. How does rx distinct operator keep track of previously received keys?
Should I use groupbyuntil with a duration selector instead?
Observable.Distinct uses a HashSet internally. Memory usage will be roughly proportional to the number of distinct Keys encountered. (AFAIK about 30*n bytes)
GroupByUntil does something really different than Distinct.
GroupByUntil (well) groups, whereas Distinct filters the elements of a stream.
Not sure about the intended use, but if you just want to filter consecutive identical elements you need Observable.DistinctUntilChanged which has a memory footprint independent of the number of keys.
This may be a controversial tactic, but if you were worried about distinct keys accumulating, and if there was a point in time where this could safely be reset, you could introduce a reset policy using Observable.Switch. For example, we have a scenario where the "state of the world" is reset on a daily basis, so we could reset the distinct observable daily.
Observable.Create<MyPoco>(
observer =>
{
var distinctPocos = new BehaviorSubject<IObservable<MyPoco>>(pocos.Distinct(x => x.Id));
var timerSubscription =
Observable.Timer(
new DateTimeOffset(DateTime.UtcNow.Date.AddDays(1)),
TimeSpan.FromDays(1),
schedulerService.Default).Subscribe(
t =>
{
Log.Info("Daily reset - resetting distinct subscription.");
distinctPocos.OnNext(pocos.Distinct(x => x.Id));
});
var pocoSubscription = distinctPocos.Switch().Subscribe(observer);
return new CompositeDisposable(timerSubscription, pocoSubscription);
});
However, I do tend to agree with James World's comment above regarding testing with a memory profiler to check that memory is indeed an issue before introducing potentially unnecessary complexity. If you're accumulating 32-bit ints as the key, you'd have many millions of unique items before running into memory issues on most platforms. E.g. 262144 32-bit int keys will take up one megabyte. It may be that you reset the process long before this time, depending on your scenario.
Related
We have an application, wherein we have a materialized array of items which we are going to process through a Reactive pipeline. It looks a little like this
EventLoopScheduler eventLoop = new EventLoopScheduler();
IScheduler concurrency = new TaskPoolScheduler(
new TaskFactory(
new LimitedConcurrencyLevelTaskScheduler(threadCount)));
IEnumerable<int> numbers = Enumerable.Range(1, itemCount);
// 1. transform on single thread
IConnectableObservable<byte[]> source =
numbers.Select(Transform).ToObservable(eventLoop).Publish();
// 2. naive parallelization, restricts parallelization to Work
// only; chunk up sequence into smaller sequences and process
// in parallel, merging results
IObservable<int> final = source.
Buffer(10).
Select(
batch =>
batch.
ToObservable(concurrency).
Buffer(10).
Select(
concurrentBatch =>
concurrentBatch.
Select(Work).
ToArray().
ToObservable(eventLoop)).
Merge()).
Merge();
final.Subscribe();
source.Connect();
Await(final).Wait();
If you are really curious to play with this, the stand-in methods look like
private async static Task Await(IObservable<int> final)
{
await final.LastOrDefaultAsync();
}
private static byte[] Transform(int number)
{
if (number == itemCount)
{
Console.WriteLine("numbers exhausted.");
}
byte[] buffer = new byte[1000000];
Buffer.BlockCopy(bloat, 0, buffer, 0, bloat.Length);
return buffer;
}
private static int Work(byte[] buffer)
{
Console.WriteLine("t {0}.", Thread.CurrentThread.ManagedThreadId);
Thread.Sleep(50);
return 1;
}
A little explanation. Range(1, itemCount) simulates raw inputs, materialized from a data-source. Transform simulates an enrichment process each input must go through, and results in a larger memory footprint. Work is a "lengthy" process which operates on the transformed input.
Ideally, we want to minimize the number of transformed inputs held concurrently by the system, while maximizing throughput by parallelizing Work. The number of transformed inputs in memory should be batch size (10 above) times concurrent work threads (threadCount).
So for 5 threads, we should retain 50 Transform items at any given time; and if, as here, the transform is a 1MB byte buffer, then we would expect memory consumption to be at about 50MB throughout the run.
What I find is quite different. Namely that Reactive is eagerly consuming all numbers, and Transform them up front (as evidenced by numbers exhausted. message), resulting in a massive memory spike up front (#1GB for 1000 itemCount).
My basic question is: Is there a way to achieve what I need (ie minimized consumption, throttled by multi-threaded batching)?
UPDATE: sorry for reversal James; at first, i did not think paulpdaniels and Enigmativity's composition of Work(Transform) applied (this has to do with the nature of our actual implementation, which is more complex than the simple scenario provided above), however, after some further experimentation, i may be able to apply the same principles: ie defer Transform until batch executes.
You have made a couple of mistakes with your code that throws off all of your conclusions.
First up, you've done this:
IEnumerable<int> numbers = Enumerable.Range(1, itemCount);
You've used Enumerable.Range which means that when you call numbers.Select(Transform) you are going to burn through all of the numbers as fast as a single thread can take it. Rx hasn't even had a chance to do any work because up till this point your pipeline is entirely enumerable.
The next issue is in your subscriptions:
final.Subscribe();
source.Connect();
Await(final).Wait();
Because you call final.Subscribe() & Await(final).Wait(); you are creating two separate subscriptions to the final observable.
Since there is a source.Connect() in the middle the second subscription may be missing out on values.
So, let's try to remove all of the cruft that's going on here and see if we can work things out.
If you go down to this:
IObservable<int> final =
Observable
.Range(1, itemCount)
.Select(n => Transform(n))
.Select(bs => Work(bs));
Things work well. The numbers get exhausted right at the end, and processing 20 items on my machine takes about 1 second.
But this is processing everything in sequence. And the Work step provides back-pressure on Transform to slow down the speed at which it consumes the numbers.
Let's add concurrency.
IObservable<int> final =
Observable
.Range(1, itemCount)
.Select(n => Transform(n))
.SelectMany(bs => Observable.Start(() => Work(bs)));
This processes 20 items in 0.284 seconds, and the numbers exhaust themselves after 5 items are processed. There is no longer any back-pressure on the numbers. Basically the scheduler is handing all of the work to the Observable.Start so it is ready for the next number immediately.
Let's reduce the concurrency.
IObservable<int> final =
Observable
.Range(1, itemCount)
.Select(n => Transform(n))
.SelectMany(bs => Observable.Start(() => Work(bs), concurrency));
Now the 20 items get processed in 0.5 seconds. Only two get processed before the numbers are exhausted. This makes sense as we've limited concurrency to two threads. But still there's no back pressure on the consumption of the numbers so they get chewed up pretty quickly.
Having said all of this, I tried to construct a query with the appropriate back pressure, but I couldn't find a way. The crux comes down to the fact that Transform(...) performs far faster than Work(...) so it completes far more quickly.
So then the obvious move for me was this:
IObservable<int> final =
Observable
.Range(1, itemCount)
.SelectMany(n => Observable.Start(() => Work(Transform(n)), concurrency));
This doesn't complete the numbers until the end, and it limits processing to two threads. It appears to do the right thing for what you want, except that I've had to do Work(Transform(...)) together.
The very fact that you want to limit the amount of work you are doing suggests you should be pulling data, not having it pushed at you. I would forget using Rx in this scenario, as fundamentally, what you have described is not a reactive application. Also, Rx is best suited processing items serially; it uses sequential event streams.
Why not just keep your data source enumerable, and use PLinq, Parallel.ForEach or DataFlow? All of those sound better suited for your problem.
As #JamesWorld said it may very well be that you want to use PLinq to perform this task, it really depends on if you are actually reacting to data in your real scenario or just iterating through it.
If you choose to go the Reactive route you can use Merge to control the level of parallelization occurring:
var source = numbers
.Select(n =>
Observable.Defer(() => Observable.Start(() => Work(Transform(n)), concurrency)))
//Maximum concurrency
.Merge(10)
//Schedule all the output back onto the event loop scheduler
.ObserveOn(eventLoop);
The above code will consume all the numbers first (sorry no way to avoid that), however, by wrapping the processing in a Defer and following it up with a Merge that limits parallelization, only x number of items can be in flight at a time. Start() takes a scheduler as the second argument which it uses to execute to the provided method. Finally, Since you are basically just pushing the values of Transform into Work I composed them within the Start method.
As a side note, you can await an Observable and it will be equivalent to the code you have, i.e:
await source; //== await source.LastAsync();
I'm trying to implement data caching for a web app in ASP.NET, this is for class and I've been asked to limit the number of entries in the ObjectCache, not by memory size but by the number of entries itself. This is quite easy since I can call ObjectCache.Count, but when the cache grows beyond the established limit (5, just for testing) I can't figure out how to remove the oldest element stored since it's alphabetically sorted.
This is being implemented in a Service, at the Data Access layer so I can't use any additional structure like a Queue to keep track of the insertions in the cache.
What can I do? Is there a way to filter or get the older element in the cache?
Here's the method code
public List<EventSummary> FindEvents(String keywords, long categoryId, int start, int count)
{
string queryKey = "FindEvent-" + start + ":" + count + "-" + keywords.Trim() + "-" + categoryId;
ObjectCache cache = MemoryCache.Default;
List<EventSummary> val = (List<EventSummary>)cache.Get(queryKey);
if (val != null)
return val;
Category evnCategory = CategoryDao.Find(categoryId);
List<Event> fullResult = EventDao.FindByEventCategoryAndKeyword(evnCategory, keywords, start, count);
List<EventSummary> summaryResult = new List<EventSummary>();
foreach (Event evento in fullResult)
{
summaryResult.Add(new EventSummary(evento.evnId, evento.evnName, evento.Category, evento.evnDate));
}
if (cache.Count() >= maxCacheSize)
{
//WHAT SHOULD I DO HERE?
}
cache.Add(queryKey, summaryResult, DateTime.Now.AddDays(cacheDays));
return summaryResult;
}
As mentioned in the comments, the Trim method from MemoryCache has a LRU (Least Recently Used) policy, which is the behavior you are looking for here. Unfortunately, the method is not based on an absolute number of objects to remove from the cache, but rather on a percentage, which is an int parameter. This just means that, if you try to hack your way around it and pass 1 / cache.Count() as the percentage, you have no control over how many objects have truly been removed from the cache, which is not an ideal scenario.
Another way to do it would just be to go with a DIY approach and simply not use the .NET caching utilities since, in our case, they do not seem to natively exactly fit your needs. I'm thinking of something along the lines of a SortedDictionary with the timecode of your cache objects as the key and a list of cache objects inserted into the cache at the given timecode as you values. It would be a good and, IMO, not too daring exercice to try and reproduce the .NET cache behavior you are already using, with the additionnal benefit of directly controlling the removal policy yourself.
As a side comment,not directly related to your question,
the biggest problem with caches in managed memory models is GC.
The moment you start storing over a few million entries you are asking for eventual GC pauses even with the most advanced non-blocking GCs.
It is hard to cache over 16 Gb, without pausing every now and then for 5-6 seconds (that is stop-all).
I have previously described here: https://stackoverflow.com/a/30584575/1932601
the caching of objects as-is is eventually a bad choice if you need to store very many expiring entries (say 100 million chat messages)
Take a look at what we did to store hundreds of millions of objects for a long time without killing the GC.
https://www.youtube.com/watch?v=Dz_7hukyejQ
I have a interesting problem that could be solved in a number of ways:
I have a function that takes in a string.
If this function has never seen this string before, it needs to perform some processing.
If the function has seen the string before, it needs to skip processing.
After a specified amount of time, the function should accept duplicate strings.
This function may be called thousands of time per second, and the string data may be very large.
This is a highly abstracted explanation of the real application, just trying to get down to the core concept for the purpose of the question.
The function will need to store state in order to detect duplicates. It also will need to store an associated timestamp in order to expire duplicates.
It does NOT need to store the strings, a unique hash of the string would be fine, providing there is no false positives due to collisions (Use a perfect hash?), and the hash function was performant enough.
The naive implementation would be simply (in C#):
Dictionary<String,DateTime>
though in the interest of lowering memory footprint and potentially increasing performance I'm evaluating a custom data structures to handle this instead of a basic hashtable.
So, given these constraints, what would you use?
EDIT, some additional information that might change proposed implementations:
99% of the strings will not be duplicates.
Almost all of the duplicates will arrive back to back, or nearly sequentially.
In the real world, the function will be called from multiple worker threads, so state management will need to be synchronized.
I don't belive it is possible to construct "perfect hash" without knowing complete set of values first (especially in case of C# int with limited number of values). So any kind of hashing requires ability to compare original values too.
I think dictionary is the best you can get with out of box data structures. Since you can store objects with custom comparisons defined you can easily avoid keeping strings in memeory and simply save location where whole string can be obtained. I.e. object with following values:
stringLocation.fileName="file13.txt";
stringLocation.fromOffset=100;
stringLocation.toOffset=345;
expiration= "2012-09-09T1100";
hashCode = 123456;
Where cutomom comparer will return saved hashCode or retrive string from file if needed and perform comparison.
a unique hash of the string would be fine, providing there is no false
positives due to collisions
That's not possible, if you want the hash code to be shorter than the strings.
Using hash codes implies that there are false positives, only that they are rare enough not to be a performance problem.
I would even consider to create the hash code from only part of the string, to make it faster. Even if that means that you get more false positives, it could increase the overall performance.
Provided the memory footprint is tolerable, I would suggest a Hashset<string> for the strings, and a queue to store a Tuple<DateTime, String>. Something like:
Hashset<string> Strings = new HashSet<string>();
Queue<Tuple<DateTime, String>> Expirations = new Queue<Tuple<DateTime, String>>();
Now, when a string comes in:
if (Strings.Add(s))
{
// string is new. process it.
// and add it to the expiration queue
Expirations.Enqueue(new Tuple<DateTime, String>(DateTime.Now + ExpireTime, s));
}
And, somewhere you'll have to check for the expirations. Perhaps every time you get a new string, you do this:
while (Expirations.Count > 0 && Expirations.Peek().Item1 < DateTime.Now)
{
var e = Expirations.Dequeue();
Strings.Remove(e.Item2);
}
It'd be hard to beat the performance of Hashset here. Granted, you're storing the strings, but that's going to be the only way to guarantee no false positives.
You might also consider using a time stamp other than DateTime.Now. What I typically do is start a Stopwatch when the program starts, and then use the ElapsedMilliseconds value. That avoids potential problems that occur during Daylight Saving Time changes, when the system automatically updates the clock (using NTP), or when the user changes the date/time.
Whether the above solution works for you is going to depend on whether you can stand the memory hit of storing the strings.
Added after "Additional information" was posted:
If this will be accessed by multiple threads, I'd suggest using ConcurrentDictionary rather than Hashset, and BlockingCollection rather than Queue. Or, you could use lock to synchronize access to the non-concurrent data structures.
If it's true that 99% of the strings will not be duplicate, then you'll almost certainly need an expiration queue that can remove things from the dictionary.
If memory footprint of storing whole strings is not acceptable, you have only two choices:
1) Store only hashes of strings, which implies possibility of hash collisions (when hash is shorter than strings). Good hash function (MD5, SHA1, etc.) makes this collision nearly impossible to happen, so it only depends whether it is fast enough for your purpose.
2) Use some kind of lossless compression. Strings have usually good compression ratio (about 10%) and some algorithms such as ZIP let you choose between fast (and less efficient) and slow (with high compression ratio) compression. Another way to compress strings is convert them to UTF8, which is fast and easy to do and has nearly 50% compression ratio for non-unicode strings.
Whatever way you choose, it's always tradeoff between memory footprint and hashing/compression speed. You will probably need to make some benchmarking to choose best solution.
I am looking for the most efficient way to store a collection of integers. Right now they're being stored in a HashSet<T>, but profiling has shown that these collections weigh heavily on some performance-critical code and I suspect there's a better option.
Some more details:
Random lookups must be O(1) or close to it.
The collections can grow large, so space efficiency is desirable.
The values are uniformly distributed in a 64-bit space.
Mutability is not needed.
There's no clear upper bound on size, but tens of millions of elements is not uncommon.
The most painful performance hit right now is creating them. That seems to be allocation-related - clearing and reusing HashSets helps a lot in benchmarks, but unfortunately that is not a feasible option in the application code.
(added) Implementing a data structure that's tailored to the task is fine. Is a hash table still the way to go? A trie also seems like a possibility at first glance, but I don't have any practical experience with them.
HashSet is usually the best general purpose collection in this case.
If you have any specific information about your collection you may have better options.
If you have a fixed upper bound that is not incredibly large you can use a bit vector of suitable size.
If you have a very dense collection you can instead store the missing values.
If you have very small collections, <= 4 items or so, you can store them in a regular array. A full scan of such small array may be faster than the hashing required to use the hash-set.
If you don't have any more specific characteristics of your data than "large collections of int" HashSet is the way to go.
If the size of the values is bounded you could use a bitset. It stores one bit per integer. In total the memory use would be log n bits with n being the greatest integer.
Another option is a bloom filter. Bloom filters are very compact but you have to be prepared for an occasional false positive in lookups. You can find more about them in wikipedia.
A third option is using a simle sorted array. Lookups are log n with n being the number of integers. It may be fast enough.
I decided to try and implement a special purpose hash-based set class that uses linear probing to handle collisions:
Backing store is a simple array of longs
The array is sized to be larger than the expected number of elements to be stored.
For a value's hash code, use the least-significant 31 bits.
Searching for the position of a value in the backing store is done using a basic linear probe, like so:
int FindIndex(long value)
{
var index = ((int)(value & 0x7FFFFFFF) % _storage.Length;
var slotValue = _storage[index];
if(slotValue == 0x0 || slotValue == value) return index;
for(++index; ; index++)
{
if (index == _storage.Length) index = 0;
slotValue = _storage[index];
if(slotValue == 0x0 || slotValue == value) return index;
}
}
(I was able to determine that the data being stored will never include 0, so that number is safe to use for empty slots.)
The array needs to be larger than the number of elements stored. (Load factor less than 1.) If the set is ever completely filled then FindIndex() will go into an infinite loop if it's used to search for a value that isn't already in the set. In fact, it will want to have quite a lot of empty space, otherwise search and retrieval may suffer as the data starts to form large clumps.
I'm sure there's still room for optimization, and I will may get stuck using some sort of BigArray<T> or sharding for the backing store on large sets. But initial results are promising. It performs over twice as fast as HashSet<T> at a load factor of 0.5, nearly twice as fast with a load factor of 0.8, and even at 0.9 it's still working 40% faster in my tests.
Overhead is 1 / load factor, so if those performance figures hold out in the real world then I believe it will also be more memory-efficient than HashSet<T>. I haven't done a formal analysis, but judging by the internal structure of HashSet<T> I'm pretty sure its overhead is well above 10%.
--
So I'm pretty happy with this solution, but I'm still curious if there are other possibilities. Maybe some sort of trie?
--
Epilogue: Finally got around to doing some competitive benchmarks of this vs. HashSet<T> on live data. (Before I was using synthetic test sets.) It's even beating my optimistic expectations from before. Real-world performance is turning out to be as much as 6x faster than HashSet<T>, depending on collection size.
What I would do is just create an array of integers with a sufficient enough size to handle how ever many integers you need. Is there any reason from staying away from the generic List<T>? http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx
The most painful performance hit right now is creating them...
As you've obviously observed, HashSet<T> does not have a constructor that takes a capacity argument to initialize its capacity.
One trick which I believe would work is the following:
int capacity = ... some appropriate number;
int[] items = new int[capacity];
HashSet<int> hashSet = new HashSet<int>(items);
hashSet.Clear();
...
Looking at the implementation with reflector, this will initialize the capacity to the size of the items array, ignoring the fact that this array contains duplicates. It will, however, only actually add one value (zero), so I'd assume that initializing and clearing should be reasonably efficient.
I haven't tested this so you'd have to benchmark it. And be willing to take the risk of depending on an undocumented internal implementation detail.
It would be interesting to know why Microsoft didn't provide a constructor with a capacity argument like they do for other collection types.
I have a dilemma. I have to implement prioritized queue (custom sort order). I need to insert/process/delete a lot of messages per second by using it (~100-1000).
Which design is faster at run-time?
1) custom sorted by priority collection (list)
2) list(non-sorted collection) + linq query all time when I need to process (dequeue) message
3) something else
ADDED:
SOLUTION:
List (Dictionary) of queues by priority: SortedList<int, VPair<bool, Queue<MyMessage>>>
where int - priority, bool - true if it is not empty queue
Whats your read/write ratio? Are multiple threads involved, if so, how?
As always when asking about performance, benchmark both code-paths and see for yourself (this is especially true the more specific your problem domain is).
The only way to know for sure is to measure the performance for yourself.
Well, finding an element in an unsorted data structure takes O(n) on average (one pass over the data structure). Binary Search trees have an average insertion complexity on O(log n) and also an average lookup complexity of O(log n). So in theory using something like that would be faster. In reality the overhead or the shape of the data might kill the theoretical advantage.
Also if your custom sort order can change at runtime you might have to rebuild the sorted data structure which is an additional performance hit.
In the end: If it is important for your application then try the different approaches and benchmark it yourself - it's the only way to be certain that it works.
Introducing sorting will always incur an insertion performance overhead as far as I'm aware. If there's no need for the sorting then use a nice generic Dictionary which will provide a quick lookup based on your unique key.