I am educating myself on Parallel.Invoke, and parallel processing in general, for use in current project. I need a push in the right direction to understand how you can dynamically\intelligently allocate more parallel 'threads' as required.
As an example. Say you are parsing large log files. This involves reading from file, some sort of parsing of the returned lines and finally writing to a database.
So to me this is a typical problem that can benefit from parallel processing.
As a simple first pass the following code implements this.
Parallel.Invoke(
()=> readFileLinesToBuffer(),
()=> parseFileLinesFromBuffer(),
()=> updateResultsToDatabase()
);
Behind the scenes
readFileLinesToBuffer() reads each line and stores to a buffer.
parseFileLinesFromBuffer comes along and consumes lines from buffer and then let's say it put them on another buffer so that updateResultsToDatabase() can come along and consume this buffer.
So the code shown assumes that each of the three steps uses the same amount of time\resources but lets say the parseFileLinesFromBuffer() is a long running process so instead of running just one of these methods you want to run two in parallel.
How can you have the code intelligently decide to do this based on any bottlenecks it might perceive?
Conceptually I can see how some approach of monitoring the buffer sizes might work, spawning a new 'thread' to consume the buffer at an increased rate for example...but I figure this type of issue has been considered in putting together the TPL library.
Some sample code would be great but I really just need a clue as to what concepts I should investigate next. It looks like maybe the System.Threading.Tasks.TaskScheduler holds the key?
Have you tried the Reactive Extensions?
http://msdn.microsoft.com/en-us/data/gg577609.aspx
The Rx is a new tecnology from Microsoft, the focus as stated in the official site:
The Reactive Extensions (Rx)... ...is a library to compose
asynchronous and event-based programs using observable collections and
LINQ-style query operators.
You can download it as a Nuget package
https://nuget.org/packages/Rx-Main/1.0.11226
Since I am currently learning Rx I wanted to take this example and just write code for it, the code I ended up it is not actually executed in parallel, but it is completely asynchronous, and guarantees the source lines are executed in order.
Perhaps this is not the best implementation, but like I said I am learning Rx, (thread-safe should be a good improvement)
This is a DTO that I am using to return data from the background threads
class MyItem
{
public string Line { get; set; }
public int CurrentThread { get; set; }
}
These are the basic methods doing the real work, I am simulating the time with a simple Thread.Sleep and I am returning the thread used to execute each method Thread.CurrentThread.ManagedThreadId. Note the timer of the ProcessLine it is 4 sec, it's the most time-consuming operation
private IEnumerable<MyItem> ReadLinesFromFile(string fileName)
{
var source = from e in Enumerable.Range(1, 10)
let v = e.ToString()
select v;
foreach (var item in source)
{
Thread.Sleep(1000);
yield return new MyItem { CurrentThread = Thread.CurrentThread.ManagedThreadId, Line = item };
}
}
private MyItem UpdateResultToDatabase(string processedLine)
{
Thread.Sleep(700);
return new MyItem { Line = "s" + processedLine, CurrentThread = Thread.CurrentThread.ManagedThreadId };
}
private MyItem ProcessLine(string line)
{
Thread.Sleep(4000);
return new MyItem { Line = "p" + line, CurrentThread = Thread.CurrentThread.ManagedThreadId };
}
The following method I am using it just to update the UI
private void DisplayResults(MyItem myItem, Color color, string message)
{
this.listView1.Items.Add(
new ListViewItem(
new[]
{
message,
myItem.Line ,
myItem.CurrentThread.ToString(),
Thread.CurrentThread.ManagedThreadId.ToString()
}
)
{
ForeColor = color
}
);
}
And finally this is the method that calls the Rx API
private void PlayWithRx()
{
// we init the observavble with the lines read from the file
var source = this.ReadLinesFromFile("some file").ToObservable(Scheduler.TaskPool);
source.ObserveOn(this).Subscribe(x =>
{
// for each line read, we update the UI
this.DisplayResults(x, Color.Red, "Read");
// for each line read, we subscribe the line to the ProcessLine method
var process = Observable.Start(() => this.ProcessLine(x.Line), Scheduler.TaskPool)
.ObserveOn(this).Subscribe(c =>
{
// for each line processed, we update the UI
this.DisplayResults(c, Color.Blue, "Processed");
// for each line processed we subscribe to the final process the UpdateResultToDatabase method
// finally, we update the UI when the line processed has been saved to the database
var persist = Observable.Start(() => this.UpdateResultToDatabase(c.Line), Scheduler.TaskPool)
.ObserveOn(this).Subscribe(z => this.DisplayResults(z, Color.Black, "Saved"));
});
});
}
This process runs totally in the background, this is the output generated:
in an async/await world, you'd have something like:
public async Task ProcessFileAsync(string filename)
{
var lines = await ReadLinesFromFileAsync(filename);
var parsed = await ParseLinesAsync(lines);
await UpdateDatabaseAsync(parsed);
}
then a caller could just do var tasks = filenames.Select(ProcessFileAsync).ToArray(); and whatever (WaitAll, WhenAll, etc, depending on context)
Use a couple of BlockingCollection. Here is an example
The idea is that you create a producer that puts data into the collection
while (true) {
var data = ReadData();
blockingCollection1.Add(data);
}
Then you create any number of consumers that reads from the collection
while (true) {
var data = blockingCollection1.Take();
var processedData = ProcessData(data);
blockingCollection2.Add(processedData);
}
and so on
You can also let TPL handle the number of consumers by using Parallel.Foreach
Parallel.ForEach(blockingCollection1.GetConsumingPartitioner(),
data => {
var processedData = ProcessData(data);
blockingCollection2.Add(processedData);
});
(note that you need to use GetConsumingPartitioner not GetConsumingEnumerable (see here)
Related
I am writing a file-processing program in C#. I have a HUGE text file, with 5 columns of data each separated by a bar(|). The first column in each row is a column containing a person's name, and each person has a unique name.
Its a very large text file, so I want to process it concurrently using multiple tasks. But I want every row with the same name to be processed by the SAME task, not a different task. For example, if (part of) my file reads:
Jason|BMW|354|23|1/1/2000|1:03
Jason|BMW|354|23|1/1/2000|1:03
Jason|BMW|354|23|1/1/2000|1:03
Jason|Acura|354|23|1/1/2000|1:03
Jason|BMW|354|23|1/1/2000|1:03
Jason|BMW|354|23|1/1/2000|1:03
Jason|Hyundai|392|17|1/1/2000|1:06
Mike|Infiniti|335|18|8/24/2005|7:11
Mike|Infiniti|335|18|8/24/2005|7:11
Mike|Infiniti|335|18|8/24/2005|7:11
Mike|Dodge|335|18|8/24/2005|7:18
Mike|Infiniti|335|18|8/24/2005|7:11
Mike|Infiniti|335|18|8/24/2005|7:14
Then I want one task processing ALL the Jason rows, and another task processing ALL the Mike rows. I don't want the first task processing any Mike rows, and conversely I don't want the second task processing any Jason rows. Essentially, how can I make it so that all rows of a certain name are all processed by the SAME task? ALSO, how will I know when all the processing of all the rows has been completed? I've been racking my tiny brain and I can't come up with a solution.
One idea is to implement the producer-consumer pattern, with one producer that reads the file line-by-line, and multiple consumers that process the lines, one consumer per name. Since the number of unique names may be large, it would be impractical to dedicate a Thread for each consumer, so the consumers should process the data asynchronously. Each consumer should have its own private queue with data to process. The most efficient asynchronous queue currently available in .NET is the Channel<T> class, and using it as a building block would be a good idea, but I will suggest something higher-level that this: an ActionBlock<T> from the TPL Dataflow library. This component combines a processor and a queue, is async-enabled, and is highly configurable. So it will make for a succinct, quite readable, and hopefully quite efficient solution:
var processors = new Dictionary<string, ActionBlock<string>>();
foreach (var line in File.ReadLines(filePath))
{
string name = ExtractName(line); // Reads the first part of the line
if (!processors.TryGetValue(name, out ActionBlock<string> processor))
{
processor = CreateProcessor(name);
processors.Add(name, processor);
}
var accepted = processor.Post(line);
if (!accepted) break; // The processor has failed
}
// Signal that no more lines will be sent to the processors
foreach (var processor in processors.Values) processor.Complete();
// Aggregate the completion of all processors
Task allCompletions = Task.WhenAll(processors.Values.Select(p => p.Completion));
// Wait for the completion of all processors, and allow errors to propagate
allCompletions.Wait(); // or await allCompletions;
static ActionBlock<string> CreateProcessor(string name)
{
return new ActionBlock<string>((string line) =>
{
// Process the line
}, new ExecutionDataflowBlockOptions()
{
// Configure the options if the defaults are not optimal
});
}
I'd go for a concurrent dictionary of concurrent queues, keyed by name.
In the main thread (call it the reader), loop line by line enqueueing the lines to the appropriate concurrent queue (call these the worker queues), with creation of the a new worker queue and dedicated task as needed when a new name is encountered.
It would look something like this (note: this is semi-pseudo code and semi-real code and has no error checking, so treat it as a base for a solution, not the solution).
class FileProcessor
{
private ConcurrentDictionary<string, Worker> workers = new ConcurrentDictionary<string, Worker>();
class Worker
{
public Worker() => Task = Task.Run(Process);
private void Process()
{
foreach (var row in Queue.GetConsumingEnumerable())
{
if (row.Length == 0) break;
ProcessRow(row);
}
}
private void ProcessRow(string[] row)
{
// your implementation here
}
public Task Task { get; }
public BlockingCollection<string[]> Queue { get; } = new BlockingCollection<string[]>(new ConcurrentQueue<string[]>());
}
void ProcessFile(string fileName)
{
foreach (var line in GetLinesOfFile(fileName))
{
var row = line.Split('|');
var name = row[0];
// create worker as needed
var worker = workers.GetOrAdd(name, x => new Worker());
// add a row for the worker to work on
worker.Queue.Add(row);
}
// send an empty array to each worker to signal end of input
foreach (var worker in workers.Values)
worker.Queue.Add(new string[0]);
// now wait for all workers to be done
Task.WaitAll(workers.Values.Select(x => x.Task).ToArray());
}
private static IEnumerable<string> GetLinesOfFile(string fileName)
{
// this helps limit memory consumption by not loading
// the whole file at once
return File.ReadLines(fileName);
}
}
I suggest that your reader thread stream the file rather than reading the entire file; you stated the file was huge, so streaming would be memory friendly). That reader thread is I/O bound, so if you can async/await it, that would be better than my simple Process() doing a foreach with no awaiting.
The features of this approach:
dedicated task per person's name
use of a sentinel value to signal end of input
use of Task.WaitAll to join back to the main thread
assumes the tasks are CPU bound. If they are I/O bound, consider using async/await and Task.WhenAll instead
file is streamed into memory with File.ReadLines()
names do not need to be sorted because the queue to enqueue to is selected by name on-demand
Refinements
In the interest of completeness, the approach above is a bit naive and can be refined by... reading all of the comments and answers; users Zoulias and Mercer in particular have good points. We can refine this approach with
adapt this to TPL Channels and use CompleteAdding. These are not only better abstractions, but more efficient (abstraction and efficient can often be at odds, but not in this case).
reduce the name-to-thread or name-to-task dedication, which can exhaust resources in the case of a large number of names, and instead map names to buckets or partitions where each bucket/partition has a dedicated task/thread.
For the second point, for example, you could have
// create worker as needed
var worker = workers.GetOrAdd(GetPartitionKey(name), x => new Worker());
where GetPartitionKey() could be implemented something like
private string GetPartitionKey(string name) =>
name[0] switch
{
>= 'a' and <= 'f' => "A thru F bucket",
>= 'A' and <= 'F' => "A thru F bucket",
>= 'g' and <= 'k' => "G thru K bucket",
>= 'G' and <= 'K' => "G thru K bucket",
_ => "everything else bucket"
}
or whatever algorithm you want to use as a partition selector.
how can I make it so that all rows of a certain name are all processed by the SAME task?
A System.Threading.Task can be created using various TaskCreationOptions that dictate how and when their threads and resources are managed during their lifetime. For an operation for consuming large amount of data and furthermore segregating the consumption of data to specific threads - you may want to consider creating the tasks that are responsible for individual names with the option TaskCreationOptions.LongRunning which may provide a hint to the task scheduler that an additional thread might be required for the task so that it does not block the forward progress of other threads or work items on the local thread-pool queue.
For the actual how, I would recommend starting various 'Worker' threads, each with their own Task and a way for your main task (the one reading the file, or parsing the JSON data) to communicate between the two that more work needs to be completed.
Consider the use of thread-safe collections such as a ConcurrentQueue<T> or other various collections that may help you in streaming data between threads for consumption safely.
Here's a very limited example of the structure you may want to consider:
void Worker(ConcurrentQueue<string> Queue, CancellationToken Token)
{
// keep the worker in a
while (Token.IsCancellationRequested is false)
{
// check to see if the queue has stuff, and consume it
if (Queue.TryDequeue(out string line))
{
Console.WriteLine($"Consumed Line {line} {Thread.CurrentThread.ManagedThreadId}");
}
// yield the thread incase other threads have work to do
Thread.Sleep(10);
}
Console.WriteLine("Finished Work");
}
// data could be a reader, list, array anything really
IEnumerable<string> Data()
{
yield return "JASON";
yield return "Mike";
yield return "JASON";
yield return "Mike";
}
void Reader()
{
// create some collections to stream the data to other tasks
ConcurrentQueue<string> Jason = new();
ConcurrentQueue<string> Mike = new();
// make sure we have a way to cancel the workers if we need to
CancellationTokenSource tokenSource = new();
// start some worker tasks that will consume the data
Task[] workers = {
new Task(()=> Worker(Jason, tokenSource.Token), TaskCreationOptions.LongRunning),
new Task(()=> Worker(Mike, tokenSource.Token), TaskCreationOptions.LongRunning)
};
for (int i = 0; i < workers.Length; i++)
{
workers[i].Start();
}
// iterate the data and send it off to the queues for consumption
foreach (string line in Data())
{
switch (line)
{
case "JASON":
Console.WriteLine($"Sent line to JASON {Thread.CurrentThread.ManagedThreadId}");
Jason.Enqueue(line);
break;
case "Mike":
Console.WriteLine($"Sent line to Mike {Thread.CurrentThread.ManagedThreadId}");
Mike.Enqueue(line);
break;
default:
Console.WriteLine($"Disposed unknown line {Thread.CurrentThread.ManagedThreadId}");
break;
}
}
// make sure that worker threads are cancelled if parent task has been cancelled
try
{
// wait for workers to finish by checking collections
do
{
Thread.Sleep(10);
} while (Jason.IsEmpty is false && Mike.IsEmpty is false);
}
finally
{
// cancel the worker threads, if they havent already
tokenSource.Cancel();
}
}
// make sure we have a way to cancel the reader if we need to
CancellationTokenSource tokenSource = new();
// start the reader thread
Task[] tasks = { Task.Run(Reader, tokenSource.Token) };
Console.WriteLine("Starting Reader");
Task.WaitAll(tasks);
Console.WriteLine("Finished Reader");
// cleanup the tasks if they are still running some how
tokenSource?.Cancel();
// dispose of IDisposable Object
tokenSource?.Dispose();
Console.ReadLine();
I have a C# requirement for individually processing a 'great many' (perhaps > 100,000) records. Running this process sequentially is proving to be very slow with each record taking a good second or so to complete (with a timeout error set at 5 seconds).
I would like to try running these tasks asynchronously by using a set number of worker 'threads' (I use the term 'thread' here cautiously as I am not sure if I should be looking at a thread, or a task or something else).
I have looked at the ThreadPool, but I can't imagine it could queue the volume of requests required. My ideal pseudo code would look something like this...
public void ProcessRecords() {
SetMaxNumberOfThreads(20);
MyRecord rec;
while ((rec = GetNextRecord()) != null) {
var task = WaitForNextAvailableThreadFromPool(ProcessRecord(rec));
task.Start()
}
}
I will also need a mechanism that the processing method can report back to the parent/calling class.
Can anyone point me in the right direction with perhaps some example code?
A possible simple solution would be to use a TPL Dataflow block which is a higher abstraction over the TPL with configurations for degree of parallelism and so forth. You simply create the block (ActionBlock in this case), Post everything to it, wait asynchronously for completion and TPL Dataflow handles all the rest for you:
var block = new ActionBlock<MyRecord>(
rec => ProcessRecord(rec),
new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = 20});
MyRecord rec;
while ((rec = GetNextRecord()) != null)
{
block.Post(rec);
}
block.Complete();
await block.Completion
Another benefit is that the block starts working as soon as the first record arrives and not only when all the records have been received.
If you need to report back on each record you can use a TransformBlock to do the actual processing and link an ActionBlock to it that does the updates:
var transform = new TransfromBlock<MyRecord, Report>(rec =>
{
ProcessRecord(rec);
return GenerateReport(rec);
}, new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = 20});
var reporter = new ActionBlock<Report>(report =>
{
RaiseEvent(report) // Or any other mechanism...
});
transform.LinkTo(reporter, new DataflowLinkOptions { PropagateCompletion = true });
MyRecord rec;
while ((rec = GetNextRecord()) != null)
{
transform.Post(rec);
}
transform.Complete();
await transform.Completion
Have you thought about using parallel processing with Actions?
ie, create a method to process a single record, add each record method as an action into a list, and then perform a parrallel.for on the list.
Dim list As New List(Of Action)
list.Add(New Action(Sub() MyMethod(myParameter)))
Parallel.ForEach(list, Sub(t) t.Invoke())
This is in vb.net, but I think you get the gist.
At work one of our processes uses a SQL database table as a queue. I've been designing a queue reader to check the table for queued work, update the row status when work starts, and delete the row when the work is finished. I'm using Parallel.Foreach to give each process its own thread and setting MaxDegreeOfParallelism to 4.
When the queue reader starts up it checks for any unfinished work and loads the work into an list, then it does a Concat with that list and a method that returns an IEnumerable which runs in an infinite loop checking for new work to do. The idea is that the unfinished work should be processed first and then the new work can be worked as threads are available. However what I'm seeing is that FetchQueuedWork will change dozens of rows in the queue table to 'Processing' immediately but only work on a few items at a time.
What I expected to happen was that FetchQueuedWork would only get new work and update the table when a slot opened up in the Parallel.Foreach. What's really odd to me is that it behaves exactly as I would expect when I run the code in my local developer environment, but in production I get the above problem.
I'm using .Net 4. Here is the code:
public void Go()
{
List<WorkData> unfinishedWork = WorkData.LoadUnfinishedWork();
IEnumerable<WorkData> work = unfinishedWork.Concat(FetchQueuedWork());
Parallel.ForEach(work, new ParallelOptions { MaxDegreeOfParallelism = 4 }, DoWork);
}
private IEnumerable<WorkData> FetchQueuedWork()
{
while (true)
{
var workUnit = WorkData.GetQueuedWorkAndSetStatusToProcessing();
yield return workUnit;
}
}
private void DoWork(WorkData workUnit)
{
if (!workUnit.Loaded)
{
System.Threading.Thread.Sleep(5000);
return;
}
Work();
}
I suspect that the default (Release mode?) behaviour is to buffer the input. You might need to create your own partitioner and pass it the NoBuffering option:
List<WorkData> unfinishedWork = WorkData.LoadUnfinishedWork();
IEnumerable<WorkData> work = unfinishedWork.Concat(FetchQueuedWork());
var options = new ParallelOptions { MaxDegreeOfParallelism = 4 };
var partitioner = Partitioner.Create(work, EnumerablePartitionerOptions.NoBuffering);
Parallel.ForEach(partioner, options, DoWork);
Blorgbeard's solution is correct when it comes to .NET 4.5 - hands down.
If you are constrained to .NET 4, you have a few options:
Replace your Parallel.ForEach with work.AsParallel().WithDegreeOfParallelism(4).ForAll(DoWork). PLINQ is more conservative when it comes to buffering items, so this should do the trick.
Write your own enumerable partitioner (good luck).
Create a grotty semaphore-based hack such as this:
(Side-effecting Select used for the sake of brevity)
public void Go()
{
const int MAX_DEGREE_PARALLELISM = 4;
using (var semaphore = new SemaphoreSlim(MAX_DEGREE_PARALLELISM, MAX_DEGREE_PARALLELISM))
{
List<WorkData> unfinishedWork = WorkData.LoadUnfinishedWork();
IEnumerable<WorkData> work = unfinishedWork
.Concat(FetchQueuedWork())
.Select(w =>
{
// Side-effect: bad practice, but easier
// than writing your own IEnumerable.
semaphore.Wait();
return w;
});
// You still need to specify MaxDegreeOfParallelism
// here so as not to saturate your thread pool when
// Parallel.ForEach's load balancer kicks in.
Parallel.ForEach(work, new ParallelOptions { MaxDegreeOfParallelism = MAX_DEGREE_PARALLELISM }, workUnit =>
{
try
{
this.DoWork(workUnit);
}
finally
{
semaphore.Release();
}
});
}
}
i've a methode that return a IEnumerable of my business object. In this method i parse the content of a large text file to the business object model. There is no threading stuff in it.
In my ViewModel (WPF) i need to store and display the results of the method.
Store is an ObservableCollection.
Here ist the observable code:
private void OpenFile(string file)
{
_parser = new IhvParser();
App.Messenger.NotifyColleagues(Actions.ReportContentInfo, new Model.StatusInfoDisplayDTO { Information = "Lade Daten...", Interval = 0 });
_ihvDataList.Clear();
var obs = _parser.ParseDataObservable(file)
.ToObservable(NewThreadScheduler.Default)
.ObserveOnDispatcher()
.Subscribe<Ihv>(AddIhvToList, ReportError, ReportComplete);
}
private void ReportComplete()
{
App.Messenger.NotifyColleagues(Actions.ReportContentInfo, new Model.StatusInfoDisplayDTO { Information = "Daten fertig geladen.", Interval = 3000 });
RaisePropertyChanged(() => IhvDataList);
}
private void ReportError(Exception ex)
{
MessageBox.Show("...");
}
private void AddIhvToList(Ihv ihv)
{
_ihvDataList.Add(ihv);
}
And this is the parser code:
public IEnumerable<Model.Ihv> ParseDataObservable(string file)
{
using (StreamReader reader = new StreamReader(file))
{
var head = reader.ReadLine(); //erste Zeile ist Kopfinformation
if (!head.Contains("BayBAS") || !head.Contains("2.3.0"))
{
_logger.ErrorFormat("Die Datei {0} liegt nicht im BayBAS-Format 2.3.0 vor.");
}
else
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
if (line.Length != 1415)
{
_logger.ErrorFormat("Die Datei {0} liegt nicht im BayBAS-Format 2.3.0 vor.");
break;
}
var tempIhvItem = Model.Ihv.Parse(line);
yield return tempIhvItem;
}
reader.Close();
}
}
}
Why do i don't get the results async? Before i see the results in my DataGrid, all items are parsed and delivered.
Can anybody help?
Andreas
Are you sure this isn't happening asynchronously? Are you assuming this based on what you perceive in the UI, or have you set breakpoints and determined that this is, in fact, the case?
Note that WPF's Dispatcher uses a priority queue, and DispatcherScheduler schedules items with Normal priority, which trumps the priority levels used for input, layout, and rendering. If the results come in quickly enough, then the UI may not get updated until after the last result has been processed: the dispatcher might be too busy processing results to perform layout and rendering of the UI.
You could try overriding the behavior of the DispatcherScheduler to schedule at a custom priority like so:
public class PriorityDispatcherScheduler : DispatcherScheduler
{
private readonly DispatcherPriority _priority;
public PriorityDispatcherScheduler(DispatcherPriority priority)
: this(priority, Dispatcher.CurrentDispatcher) {}
public PriorityDispatcherScheduler(DispatcherPriority priority, Dispatcher dispatcher)
: base(dispatcher)
{
_priority = priority;
}
public override IDisposable Schedule<TState>(TState state, Func<IScheduler, TState, IDisposable> action)
{
if (action == null)
throw new ArgumentNullException("action");
var d = new SingleAssignmentDisposable();
this.Dispatcher.BeginInvoke(
_priority,
(Action)(() =>
{
if (d.IsDisposed)
return;
d.Disposable = action(this, state);
}));
return d;
}
}
And then modify your observable sequence by replacing ObserveOnDispatcher() with ObserveOn(new PriorityDispatcherScheduler(p)), where p is an appropriate priority level (e.g., Background).
Also, this looks highly suspect: ToObservable(NewThreadScheduler.Default). I believe this will cause a new thread to be created every time a result comes in, for the sole purpose of passing it to the dispatcher, after which the new thread will terminate. This is almost certainly not what you intended. I assume you simply wanted the file processed on a separate thread; as written, your code would literally end up creating 1,000 short-lived threads if your IEnumerable yields 1,000 items, none of which would actually be doing the work of reading the file.
Lastly, is OpenFile() being invoked on the dispatcher thread? If so, I believe what is going to happen is as follows:
Dispatcher (on UI thread) will call Subscribe(), which will process the chain of observable operators all the way back to ParseDataObservable(file).
Dispatcher will iterate through your IEnumerable sequence, firing each result into the observable sequence created by ToObservable().
Each result passed into the observable sequence will be scheduled for delivery on the dispatcher (the very same dispatcher that is currently running).
If this is the case, then the entire file will be read before any of the results get passed to AddIhvToList(), because the dispatcher is tied up reading the file and won't get around to processing the results in its queue until it has finished. If this is what is happening, you can try altering your code as follows:
var obs = _parser.ParseDataObservable(file)
.ToObservable()
.SubscribeOn(/*NewThread*/Scheduler.Default)
.ObserveOnDispatcher() // consider using PriorityDispatcherScheduler
.Subscribe<Ihv>(AddIhvToList, ReportError, ReportComplete);
Injecting SubscribeOn() should ensure that the iteration of your IEnumerable (i.e., the reading of the file) occurs on a separate thread. Scheduler.Default should suffice here, but you could use a NewThreadScheduler if you really need to (you probably don't). The dispatcher thread will return from Subscribe() after everything has been set up, freeing it up to continue processing its queue, i.e., passing the results to AddIhvToList() as they come in. This should give you the asynchronous behavior you desire.
I'm new to RX.
I'd like to traverse an IEnumerable and publish to multi DataHandlers that process the data in their respective threads.
Below is my sample program. The publish works and a new thread is created, but the 3 RowHandlers are all running in 1 thread. I need 3 threads. What is the best way to implement this?
class Program
{
public class MyDataGenerator
{
public IEnumerable<int> myData()
{
//Heavy lifting....Don't want to process more than once.
yield return 1;
yield return 2;
yield return 3;
yield return 4;
yield return 5;
yield return 6;
}
}
static void Main(string[] args)
{
MyDataGenerator h = new MyDataGenerator();
Console.WriteLine("Thread id " + Thread.CurrentThread.ManagedThreadId.ToString());
//
var shared = h.myData().ToObservable().Publish();
///////////////////////////////
// Row Handling Requirements
//
// 1. Single Scan of IEnumerable.
// 2. Row handlers process data in their own threads.
// 3. OK if scanning thread blocks while data is processed
//
//Create the RowHandlers
MyRowHandler rn1 = new MyRowHandler();
rn1.ido = shared.Subscribe(i => rn1.processID(i));
MyRowHandler rn2 = new MyRowHandler();
rn2.ido = shared.Subscribe(i => rn2.processID(i));
MyRowHandler rn3 = new MyRowHandler();
rn3.ido = shared.Subscribe(i => rn3.processID(i));
//
shared.Connect();
}
public class MyRowHandler
{
public IDisposable ido = null;
public void processID(int i)
{
var o = Observable.Start(() =>
{
Console.WriteLine(String.Format("Start Thread ID {0} Int{1}", Thread.CurrentThread.ManagedThreadId, i));
Thread.Sleep(30);
Console.WriteLine("Done Thread ID"+Thread.CurrentThread.ManagedThreadId.ToString());
}
);
o.First();
}
}
}
Discovery :
The coding speed & code quality gains one receives from Rx come at the expense of performance. Task/Delegates are without a doubt multiples faster. That means that the most important thing one needs to learn about Rx is when to use Rx. Below is a draft summary guideline. For large volumes I can see use for Rx in chuncking, combining, and other many stream-many handler models; however, basic Async should not use rx.
I'd post an image with a matrix guideline, but the site won't let me post images
If I understand your sequencing requirements correctly and you want three parallel running scans, you can just observe on the TaskPool and subscribe from there;
...
//Create the RowHandlers
MyRowHandler rn1 = new MyRowHandler();
rn1.ido = shared.ObserveOn(Scheduler.TaskPool).Subscribe(i => rn1.processID(i));
...
Note that since you're then running asynchronously and your main thread doesn't wait for the scans to get done, your program will terminate right away unless you for example put a Console.ReadKey() at the end of the program.
EDIT: Regarding running the same thread "all the way", you're scheduling a bit strangely for that. If you drop the observable in the rowhandler, you can use Scheduler.NewThread and get good results;
...
var rowHandler1 = new MyRowHandler();
rowHandler1.ido = shared.ObserveOn(Scheduler.NewThread).Subscribe(rowHandler1.ProcessID);
...
public void ProcessID(int i)
{
Console.WriteLine(String.Format("Start Thread ID {0} Int{1}", Thread.CurrentThread.ManagedThreadId, i));
Thread.Sleep(30);
Console.WriteLine("Done Thread ID" + Thread.CurrentThread.ManagedThreadId.ToString(CultureInfo.InvariantCulture));
}
That will give each subscription its own thread, and stay with it.