I will say this right off the bat. I am an amateur at threading. I am a senior c# web developer, but I have a project that requires me to populate a lot of objects that take a long time to populate as they require WebRequests and Responses to populate. I have everything working without threading, but it does not run fast enough for my requirements. I would like to pass everything to a ThreadPool to have the threading managed for me as I may be queuing up 20,000 threads at the same time and for obvious reasons. I do not want to hit a website with the requests needed to populate all of them at once.
What I would like to do is to pass in an object, populate it, and then add it to a collection in the main thread once it is populated. Then once all the objects are populated, continue on with execution of the program. I do not know how many objects will need to be populated until they are all populated either.
My question...What is the best approach to doing this?
Here is the loop that I am trying to speed up:
foreach (HElement hElement in repeatingTag.RunRepeatingTagInstruction())
{
object newObject = Activator.CreateInstance(currentObject.GetType().GetGenericArguments()[0]);
List<XElement> ordering = GetOrdering(tagInstructions.Attribute("type").Value);
RunOrdering(ordering, newObject, hElement);
MethodInfo method = currentObject.GetType().GetMethod("Add");
method.Invoke(currentObject, new[] { newObject });
}
I don't know what the object is beforehand so I create it using the Activator. The RunOrdering method runs through the instructions that I pass that tell it how to populate the object. Then I add it to the collection. Also, the object itself may have properties that will require this method to run through and populate their data.
Since you probably have to wait for them all to be complete, all you need is a Parallel.ForEach() or equivalent. And a Thread-safe collection. Note that for I/O intensive tasks you would want to limit the number of Threads. 20.00 threads would be insane in any situation.
But we would need to see more details (code). Note that there is no such thing as "a collection in the main thread".
populate a lot of objects that take a
long time to populate as they require
WebRequests and Responses
Avoid Threading if you are doing requests.
No speedup after two threads, merely existent with the two.
A lot of truble for nothing.
Couple of suggestions:
If you are on .net 4 try using Tasks instead. You would have much better control over scheduling. Try to not share any objects, make them immutable and all the warnings and best practices about synchronisation, shared data etc.
And secondly you might want to think of an out of process solution like message queues (xMQ products or poor man's database table as queue) so you would have the chance to distribute your task over multiple machines if you need to.
Related
So the question is long but pretty self explanatory. I have an app that runs on multiple servers that uses parallel looping to handle objects coming out of a MongoDB Collection. Since MongoDB forces me to allow multi read access I cannot stop multiple processes and or servers from grabbing the same document from the collection and duplicating work.
The program is such that the app waits for information to appear, does some work to figure out what to do with it, then deletes it once it's done. What I hope to achieve is that if I could keep documents from being accessed at the same time, knowing that once one has been read it will eventually be deleted, I can speed up my throughput a bit overall by reducing the number of duplicates and allowing the apps to grab things that aren't being worked.
I don't think pessimistic is quite what I'm looking for but maybe I misunderstood the concept. Also if alternative setups are being used to solve the same problem I would love to hear what might be being used.
Thanks!
What I hope to achieve is that if I could keep documents from being accessed at the same time
The simplest way to achieve this is by introducing a dispatch process architecture. Add a dedicated process that just watch for changes then delegate or dispatch the tasks out to multiple workers.
The process could utilise MongoDB ChangeStreams to access real-time data changes on a single collection, a database or an entire deployment. Once it receives a stream/document, just sends to a worker for processing.
This should also reduce multiple workers trying to access the same tasks and have a logic to back-down.
I want to compute an easy parallelizable calculation (e.g. Mandelbrot) with Orleans on different grains parallel and merge the result back together once the grains are done. However, I am not sure how to do this or if Orleans is even the right framework for this kind of problem.
Also let me mention that this won't be any project which will go in production, I am just playing around with Orleans.
Here is my idea so far:
I have one graintype (let's call it "maingrain") which is an entry point for the client (might also be a grain). This grain then estimates the amount of needed processing power and divides the task into smaller parts which are distributed to other grains from another graintype (I will call these "subgrains"). It's no big deal to let these subgrains do the work and wait for a result which can be returned to the client, however I am not sure how to handle the subgrains.
Lets say, there is a call where I want to use 10 subgrains. I get each by a new GUID and let them work. They are done and the client gets the result.
Now there is a call where I want to use X subgrains:
Should I simply activate X new subgrains with X new GUIDs and let the garbage collector do the cleanup?
Should I somehow reuse the previously activated subgrains (some kind of pooling) and how do I know that a subgrain is already reusable (=not busy)?
What happens, if I want to use multiple maingrains. Does each handle it's own subgrains?
How would you do it? Thank you.
You can mark the subgrain as "StatelessWorker" using the Orleans.Concurrency.StatelessWorkerAttribute. This will then automatically scale out the grain (create multiple instances of the same grain) when there's a backlog of messages in it's queue, allowing for these sub tasks to be processed in parallel.
Found this quite interesting regarding stateless workers: http://encloudify.blogspot.co.uk/2014/05/grains-grains-and-more-grains.html
I have a program of mine which makes use of the c# concurrent Queue to pass data from my one component to the other.
Component 1:
Multiple network connections receive data and then put it into this Queue
Component 2:
Reads data from this queue and then processes it.
Ok, good all makes sense ( I sure hope).
Now what I want to know, is what is the best / most efficient way to go about passing the data between the two components?
Option 1:
Poll the queue for new data in component 2? Which will entail blocking code, or atleast a while(true)
Option 2:
I don't know if this is possible, but that's why im here asking. Does the queue data structure not have a sort of functionality that say my component 2 can register to the queue to be notified of any inserts / changes? This way whenever data is added it can just go fetch it, and I can then avoid any blocking / polling code.
Component 1 ( Producer) require either manual or automatic blocking since you anticipate multiple access (multiple post mentioned) while producing. This means BlockingQueue make sense in Component1. However, in Component 2 (Consumer), if you think you only (at any time) have one consumer then you don’t need any blocking code.
In order to save or avoid while, you must need a mechanism to inform the consumer that someone has added something into the queue. This can be achieved using a custom eventing (not talking about EventHandle subtypes). Keep in mind, you may not have the element order in such style of eventing.
For a simple implementation of Producer/Consumer you can try using BlockingCollection. For a more complex consumption of data from from various sources Reactive Extensions might help. It's a much steeper learning curve but it is a very powerful pull based framework, so you don't need to do any polling.
So, I've got a WCF application that accepts requests to do work at a specific time. I could have a list of thousands of things to do in the future at varying times. Is there an existing framework that we can leverage to do this? The current implementation polls a database, looking for things to do based on a datetime, which smells.
A few ideas.
Timers. Set a timer when the request comes in that fires are the appropriate time. This seems like I could have too many threads floating around.
Maintain a list of objects with a datetime in memory, poll this for things to do.
Use a library like quartz. I have concerns as to whether this can handle the volume.
If you keep a list of tasks sorted by their trigger times (Your database should be able to do this without any issues. If you want to keep it in-memory, Power Collections has a priority queue you could use), you can get by with a single timer that always activates for the first one in the list.
Scenario:
Data is received and written to database with timestamps. I need to process the raw data in the order that is received based on the time stamp and write it back to the database, different table, again maintaining the order based on the timestamp.
I came up with the following design: Created two queues, one for storing raw data from database, another for storing processed data before it's written back to DB. I have two threads, one reading to the Initial queue and another reading from Result queue. In between i spawn multiple threads to process data from Initial queue and write it to Result queue.
I have experimented with SortedList (manual locking) and BlockingCollection. I have used two approaches to process in parallel: Parallel.For(ForEach) and TaskFactory.Task.StartNew.
Each unit of data may take variable amount of time to process, based on several factors. One thread can still be processing the first data point while other threads are done with three or four datapoints each, messing up the timestamp order.
I have found out about OrderingPartitioner recently and i thought it would solve the problem, but following MSDNs example i can see, that it's not sorting the underlying collection either. May be i need to implement custom partitioner to order my collection of complex data types? or may be there's a better way of approaching the problem?
Any suggestions and/or links to articles discussing similar problem is highly appreciated.
Personally, I would at least try to start with using a BlockingCollection<T> for the input and a ConcurrentQueue<T> instance for the results.
I would use Parallel Linq to process the results. In order to preserve the order during your processing, you could use AsOrdered() on the PLINQ statement.
Have you considered PLINQ and AsOrdered()? It might be helpful for what you're trying to achieve.
http://msdn.microsoft.com/en-us/library/dd460719.aspx
Maybe you've considered these things, but...
Why not just pass the timestamp to the database and then either let the database do the ordering or fix the ordering in the database after all processing threads have returned? Do the sql statements have to be executed sequentially?
PLINQ is great but I would try to avoid thread synchronization requirements and simply pass more ordering data to the database if you can.