Errors in using Parallel.ForEach - c#

Originally,I am using foreach to do my task.
However, I would like to improve the efficiency of the task.
So, I want to use Parallel.ForEach to do my task.
However, error "Object reference not set to an instance occurred." happened.
Here's my code:
System.Threading.Tasks.Parallel.ForEach(items, item =>
{
System.Threading.Tasks.Parallel.ForEach(item.a, amount =>
{
WriteToCsv(file, amount.columna, count);
count++;
});
});
If I used foreach(var item in items) and foreach (amount in item.a), the code works fine.
Did I miss something for the Parallel.ForEach method?

Why would you want to do this in parallel? You are appending values to a file, and that is a serial (i.e., non-parallel) process.
If you just use a regular foreach, you'll be fine.

You shouldn't nest multiple Parallel Foreach loops, the inner loop(s) is basically stealing resources from the outer parallel loop which is doing a much better job at improving performance..
Only 1 thread can write to a file at once too, so presumably whichever functions your using inside WriteToCsv() are martialing everything back into a single thread before data is written anyway.
One way to speed this up would be to construct the text to be appended into the CSV file in a Parallel loop, then using a single write operation to add it in there:
var sb = new StringBuilder("");
System.Threading.Tasks.Parallel.ForEach(items, item =>
{
foreach( var amount in item.a )
{
sb.append( GetCsvText(file, amount.columna, count) );
count++;
};
});
WriteToCsv( sb.ToString() );
Obviously you can't control the order text will be put into the CSV file because it's just a race between threads in this code, I assume this is ok though as the same is true of your sample code.
N.B: using StringBuilder is important for memory performance when constructing strings in loops
HTH

Related

Can PLINQ generate two streams? An error stream and a data stream

I have a large (700MB+) CSV file that I am processing with PLINQ. Here's the query:
var q = from r in ReadRow(src).AsParallel()
where BoolParser.Parse(r[vacancyIdx])
select r[apnIdx];
It generates a list of APNs for vacant properties if you are wondering.
My question is, How can I extract a stream of "bad records" without doing 2 passes on the query/stream?
Each line in the CSV file should contain colCount records. I would like to enforce this by changing the where clause to "where r.Count == colCount && BoolParser.Parse(r[vacancyIdx])".
But, then any malformed input is going to silently disappear.
I need to capture any malformed lines in an error log and flag that n lines of input were not processed.
Currently I do this work in the ReadRow() function, but it seems like there ought to be a plinqy way to split a stream of data into 2 or more streams to be processed.
Anyone out there know how to do this? If not, does anyone know how to get this suggestion added to the PLINQ new feature requests? ;-)
What you're asking for doesn't make much sense, because PLINQ is based on a "pull" model (i.e. the consumer decides when to consume an item). Consider code like (using C# 7 tuple syntax for brevity):
var (good, bad) = ReadRow(src).AsParallel().Split(r => r.Count == colCount);
foreach (var item in bad)
{
// do something
}
foreach (var item in good)
{
// do something else
}
The implementation of Split has two options:
Block one stream when the current item belongs to the other stream.
In the above example, this would cause deadlock as soon as the first good item would appear.
Cache the values of one stream while the other stream is being read.
In the above example, assuming the vast majority of items are good, this would cause about 700 MB of your data to be kept in memory during the moment between the two foreach loops. So this is also undesirable.
So, I think your solution of doing it in ReadRow is okay.
Another option would be something like:
where CheckCount(r) && BoolParser.Parse(r[vacancyIdx])
Here, the CheckCount method reports any errors it finds and returns false for them. (If you do this, make sure to make the reporting thread-safe.)
If you still want to propose adding something like this to PLINQ, or just discuss the options, you can create an issue in the corefx repository.

Is there a faster way to print to a TextBox than AppendText?

I have a background thread that is reading messages from a device and formatting them and passing them to the 'Consumer' node of a Producer/Consumer collection that then prints all of the messages to the string. The problem I am running into is that the logging lags a little bit behind the end of the process, so I am trying to find a faster way to print to the screen. Here is my consumer method:
private void DisplayMessages(BlockingCollection<string[]> messages)
{
try
{
foreach (var item in messages.GetConsumingEnumerable(_cancellationTokenSource.Token))
{
this.Invoke((MethodInvoker)delegate
{
outputTextBox.AppendText(String.Join(Environment.NewLine, item) + Environment.NewLine);
});
}
}
catch (OperationCanceledException)
{
//TODO:
}
}
I have done some benchmark tests on my producer methods, and even logging to the console, and it does appear that writing to this TextBox is what is a little slower. During each process, I am logging ~61,000 lines that are about 60 characters long.
I have researched that it is better to use .AppendText() than to say textBox.Text += newText as this resets the entire TextBox's text. I am looking for a solution that may include a faster way to print to a TB (or a UI element better suited for quick logging?) or if using String.Join(Environment.NewLine, item) is inefficient and could be sped up in any way.
Code with O(n^2) performance is expected to be slow.
If you can't use String.Join to construct whole output once then consider using list-oriented controls or even grids. If you have very large number of rows most list controls and grids have "virtual" mode where text is actually requested on demand.
String.Join isn't inefficient. The way it is called though causes it to be called N times (once for each message arrau in the collection) instead of just once. You could avoid this by flattening the collection with SelectMany. SelectMany takes an IEnumerable as input and returns each individual item in it:
var allLines=messages.SelectMany(msg=>msg);
var bigText=String.Join(allLines,Environment.NewLine);
Working with such a big text box though will be very difficult. You should consider using a virtualized grid or listbox, adding filters and search functionality to make it easier for users to find the messages they want.
Another option may be to simply assign the lines to the TextBox.Lines property without creating a string:
var allLines=messages.SelectMany(msg=>msg);
outputTextBox.Lines=allLines.ToArray();

Can I convert while(true) loop to EventWaitHandle?

I'm trying to process large amount of text files via Parallel.ForEach adding processed data to BlockingCollection.
The problem is that I want the Task taskWriteMergedFile to consume the collection and write them to result file at least every 800000 lines.
I guess that I can't test the collection size within the iteration because it is paralleled so I created the Task.
Can I convert while(true) loop in the task to EventWaitHandle in this case?
const int MAX_SIZE = 1000000;
static BlockingCollection<string> mergeData;
mergeData = new BlockingCollection<string>(new ConcurrentBag<string>(), MAX_SIZE);
string[] FilePaths = Directory.GetFiles("somepath");
var taskWriteMergedFile = new Task(() =>
{
while ( true )
{
if ( mergeData.Count > 800000)
{
String.Join(System.Environment.NewLine, mergeData.GetConsumingEnumerable());
//Write to file
}
Thread.Sleep(10000);
}
}, TaskCreationOptions.LongRunning);
taskWriteMergedFile.Start();
Parallel.ForEach(FilePaths, FilePath => AddToDataPool(FilePath));
mergeData.CompleteAdding();
You probably don't want to do it that way. Instead, have your task write each line to the file as it's received. If you want to limit the file size to 80,000 lines, then after the 80,000th line is written, close the current file and open a new one.
Come to think of it, what you have can't work because GetConsumingEnumerable() won't stop until the collection is marked as complete for adding. What would happen is that the thing would go through the sleep loop until there were 80,000 items in the queue, and then it would block on the String.Join until the main thread calls CompleteAdding. With enough data, you'd run out of memory.
Also, unless you have a very good reason, you shouldn't use ConcurrentBag here. Just use the default for BlockingCollection, which is ConcurrentQueue. ConcurrentBag is a rather special purpose data structure that won't perform as well as ConcurrentQueue.
So your task becomes:
var taskWriteMergedFile = new Task(() =>
{
int recordCount = 0;
foreach (var line in mergeData.GetConsumingEnumerable())
{
outputFile.WriteLine(line);
++recordCount;
if (recordCount == 80,000)
{
// If you want to do something after 80,000 lines, do it here
// and then reset the record count
recordCount = 0;
}
}
}, TaskCreationOptions.LongRunning);
That assumes, of course, that you've opened the output file somewhere else. It's probably better to open the output at the start of the task, and close it after the foreach has exited.
On another note, you probably don't want your producer loop to be parallel. You have:
Parallel.ForEach(FilePaths, FilePath => AddToDataPool(FilePath));
I don't know for sure what AddToDataPool is doing, but if it's reading a file and writing the data to the collection, you have a couple of problems. First, the disk drive can only do one thing at a time, so it ends up reading part of one file, then part of another, then part of another, etc. In order to read each chunk of the next file, it has to seek the head to the proper position. Disk head seeks are incredibly expensive--5 milliseconds or more. An eternity in CPU time. Unless you're doing heavy duty processing that takes much longer than reading the file, you're almost always better off processing one file at a time. Unless you can guarantee that the input files are on separate physical disks . . .
The second potential problem is that with multiple threads running, you can't guarantee the order in which things are written to the collection. That might not be a problem, of course, but if you expect all of the data from a single file to be grouped together in the output, that's not going to happen with multiple threads each writing multiple lines to the collection.
Just something to keep in mind.

C# Parallel.foreach - Making variables thread safe

I have been rewriting some process intensive looping to use TPL to increase speed. This is the first time I have tried threading, so want to check what I am doing is the correct way to do it.
The results are good - processing the data from 1000 Rows in a DataTable has reduced processing time from 34 minutes to 9 minutes when moving from a standard foreach loop into a Parallel.ForEach loop. For this test, I removed non thread safe operations, such as writing data to a log file and incrementing a counter.
I still need to write back into a log file and increment a counter, so i tried implementing a lock which encases the streamwriter/increment code block.
FileStream filestream = new FileStream("path_to_file.txt", FileMode.Create);
StreamWriter streamwriter = new StreamWriter(filestream);
streamwriter.AutoFlush = true;
try
{
object locker = new object();
// Lets assume we have a DataTable containing 1000 rows of data.
DataTable datatable_results;
if (datatable_results.Rows.Count > 0)
{
int row_counter = 0;
Parallel.ForEach(datatable_results.AsEnumerable(), data_row =>
{
// Process data_row as normal.
// When ready to write to log, do so.
lock (locker)
{
row_counter++;
streamwriter.WriteLine("Processing row: {0}", row_counter);
// Write any data we want to log.
}
});
}
}
catch (Exception e)
{
// Catch the exception.
}
streamwriter.Close();
The above seems to work as expected, with minimal performance costs (still 9 minutes execution time). Granted, the actions contained in the lock are hardly significant themselves - I assume that as the time taken to process code within the lock increases, the longer the thread is locked for, the more it affects processing time.
My question: is the above an efficient way of doing this or is there a different way of achieving the above that is either faster or safer?
Also, lets say our original DataTable actually contains 30000 rows. Is there anything to be gained by splitting this DataTable into chunks of 1000 rows each and then processing them in the Parallel.ForEach, instead of processing all 300000 rows in one go?
Writing to the file is expensive, you're holding a exclusive lock while writing to the file, that's bad. It's going to introduce contention.
You could add it in a buffer, then write to the file all at once. That should remove contention and provide way to scale.
if (datatable_results.Rows.Count > 0)
{
ConcurrentQueue<string> buffer = new ConcurrentQueue<string>();
Parallel.ForEach(datatable_results.AsEnumerable(), (data_row, state, index) =>
{
// Process data_row as normal.
// When ready to write to log, do so.
buffer.Enqueue(string.Format( "Processing row: {0}", index));
});
streamwriter.AutoFlush = false;
string line;
while (buffer.TryDequeue(out line))
{
streamwriter.WriteLine(line);
}
streamwriter.Flush();//Flush once when needed
}
Note that you don't need to maintain a loop counter,
Parallel.ForEach provides you one. Difference is that it is not
the counter but index. If I've changed the expected behavior you can
still add the counter back and use Interlocked.Increment to
increment it.
I see that you're using streamwriter.AutoFlush = true, that will hurt performance, you can set it to false and flush it once you're done writing all the data.
If possible, wrap the StreamWriter in using statement, so that you don't even need to flush the stream(you get it for free).
Alternatively, you could look at the logging frameworks which does their job pretty well. Example: NLog, Log4net etc.
You may try to improve this, if you avoid logging, or log into only thread specific log file (not sure if that makes sense to you)
TPL start as many threads as many cores you have Does Parallel.ForEach limits the number of active threads?.
So what you can do is:
1) Get numbers of core on target machine
2) Create a list of counters, with as many elements inside as many cores you have
3) Update counter for every core
4) Sum all them up after parallel execution terminates.
So, in practice :
//KEY(THREAD ID, VALUE: THREAD LOCAL COUNTER)
Dictionary<int,int> counters = new Dictionary<int, int>(NUMBER_OF_CORES);
....
Parallel.ForEach(datatable_results.AsEnumerable(), data_row =>
{
// Process data_row as normal.
// When ready to write to log, do so.
//lock (locker) //NO NEED FOR LOCK, EVERY THREAD UPDATES ITS _OWN_ COUNTER
//{
//row_counter++;
counters[Thread.CurrentThread.ManagedThreadId].Value +=1;
//NO WRITING< OR WRITING THREAD SPECIFIC FILE ONLY
//streamwriter.WriteLine("Processing row: {0}", row_counter);
//}
});
....
//AFTER EXECUTION OF PARALLEL LOOP SUM ALL COUNTERS AND GET TOTAL OF ALL THREADS.
The benefit of this that no locking envolved at all, which will drammatically improve performance. When you use .net concurent collections, they are always use some kind of locking inside.
This is naturally a basic idea, may not work as it expected if you copy paste. We are talking about multi threading , which is always a hard topic. But, hopefully, it provides to you some ideas to relay on.
First of all, it takes about 2 seconds to process a row in your table and perhaps a few milliseconds to increment the counter and write to the log file. With the actual processing being 1000x more than the part you need to serialize, the method doesn't matter too much.
Furthermore, the way you have implemented it is perfectly solid. There are ways to optimize it, but none that are worth implementing in your situation.
One useful way to avoid locking on the increment is to use Interlocked.Increment. It is a bit slower than x++ but much faster than lock {x++;}. In your case, though, it doesn't matter.
As for the file output, remember that the output is going to be serialized anyway, so at best you can minimize the amount of time spent in the lock. You can do this by buffering all of your output before entering the lock, then just perform the write operation inside the lock. You probably want to do async writes to avoid unnecessary blocking on I/O.
You can transfer the parallel code in new method. For example :
// Class scope
private string GetLogRecord(int rowCounter, DataRow row)
{
return string.Format("Processing row: {0}", rowCounter); // Write any data we want to log.
}
//....
Parallel.ForEach(datatable_results.AsEnumerable(), data_row =>
{
// Process data_row as normal.
// When ready to write to log, do so.
lock (locker)
row_counter++;
var logRecord = GetLogRecord(row_counter, data_row);
lock (locker)
streamwriter.WriteLine(logRecord);
});
This is my code that uses a parallel for. The concept is similar, and perhaps easier for you to implement. FYI, for debugging, I keep a regular for loop in the code and conditionally compile the parallel code. Hope this helps. The value of i in this scenario isn't the same as the number of records processed, however. You could create a counter and use a lock and add values for that. For my other code where I do have a counter, I didn't use a lock and just allowed the value to be potentially off to avoid the slower code. I have a status mechanism to indicate number of records processed. For my implementation, the slight chance that the count is not an issue - at the end of the loop I put out a message saying all the records have been processed.
#if DEBUG
for (int i = 0; i < stend.PBBIBuckets.Count; i++)
{
//int serverIndex = 0;
#else
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = m_maxThreads;
Parallel.For(0, stend.PBBIBuckets.Count, options, (i) =>
{
#endif
g1client.Message request;
DataTable requestTable;
request = new g1client.Message();
requestTable = request.GetDataTable();
requestTable.Columns.AddRange(
Locations.Columns.Cast<DataColumn>().Select(x => new DataColumn(x.ColumnName, x.DataType)).ToArray
());
FillPBBIRequestTables(requestTable, request, stend.PBBIBuckets[i], stend.BucketLen[i], stend.Hierarchies);
#if DEBUG
}
#else
});
#endif

Do I need to store the list before using foreach?

MSDN example shows this:
// Open the file to read from.
string[] readText = File.ReadAllLines(path);
foreach (string s in readText)
{
Console.WriteLine(s);
}
But I'm thinking this:
// Open the file to read from.
foreach (string s in File.ReadAllLines(path))
{
Console.WriteLine(s);
}
There is no difference between those two code snippets, assuming that readText is never used anywhere else.
Even in the second case, the results of the method call will end up being stored somewhere, even if that location doesn't have some name that you can refer to in your code.
On a side note, if you're going to do nothing but iterate through the lines, you can use ReadLines instead of ReadAllLines to stream the lines of text, rather than eagerly loading the entire file into memory before processing any of the lines. This prevents a long delay before accessing the first line, can provide a substantial speed improvement in the event that you end up exiting the loop before processing all lines (keep in mind that this can happen due to exceptions, in addition to explicitly exiting the loop), and dramatically reduces the memory footprint of the program even if you do end up processing all of the lines.
The two code snippets are equivalent.
Better still, if you're paying by character, there is no need to be explicit about the type of s:
foreach (var s in File.ReadAllLines(path))
{
Console.WriteLine(s);
}
There is no need to store the result in a list if the list is not needed at a later time.
My guess is that the compiler will optimize this anyway when building in Release mode.
However, there is one advantage of the first approach: during debugging, you can use the "Auto" or "Locals" window to inspect the content of the variables, which might be helpful.
foreach operation is performed on result of File.ReadAllLines(path) method. So they both are same.

Categories