Parallel.For loop freezes - c#

I'm trying to add to a DataTable some information in Parallel but if the the loop is to long it freezes or just takes a lot of time, more time then an usual for loop, this is my code for the Parallel.For loop:
Parallel.For(1, linii.Length, index =>
{
DataRow drRow = dtResult.NewRow();
alResult = CSVParser(linii[index], txtDelimiter, txtQualifier);
for (int i = 0; i < alResult.Count; i++)
{
drRow[i] = alResult[i];
}
dtResult.Rows.Add(drRow);
}
);
What's wrong? this Parallel.For loop takes much more time than a normal one, what is wrong?
Thanks!

You can't mutate a DataTable from 2 different threads; it will error. DataTable makes no attempt to be thread-safe. So: don't do that. Just do this from one thread. Most likely you are limited by IO, so you should just do it on a single thread as a stream. It looks like you're processing text data. You seem to have a string[] for lines, perhaps File.ReadAllLines() ? Well, that is very bad here:
it forces it all to load into memory
you have to wait for it all to load into memory
CSV is a multi-line format; it is not guaranteed that 1 line == 1 row
What you should do is use something like the CsvReader from code project, but even if you want to just use one line at a time, use a StreamReader:
using(var file = File.OpenText(path)) {
string line;
while((line = file.ReadLine()) != null) {
// process this line
alResult = CSVParser(line, txtDelimiter, txtQualifier);
for (int i = 0; i < alResult.Count; i++)
{
drRow[i] = alResult[i];
}
dtResult.Rows.Add(drRow);
}
}
This will not be faster using Parallel, so I have not attempted to do so. IO is your bottleneck here. Locking would be an option, but it isn't going to help you massively.
As an unrelated aside, I notice that alResult is not declared inside the loop. That means that in your original code alResult is a captured variable that is shared between all the loop iterations - which means you are already overwriting each row horribly.
Edit: illustration of why Parallel is not relevant for reading 1,000,000 lines from a file:
Approach 1: use ReadAllLines to load the lines, then use Parallel to process them; this costs [fixed time] for the physical file IO, and then we parallelise. The CPU work is minimal, and we've basically spent [fixed time]. However, we've added a lot of threading overhead and memory overhead, and we couldn't even start until all the file was loaded.
Approach 2: use a streaming API; read each one line by line - processing each line and adding it. The cost here is basically again: [fixed time] for the actual IO bandwidth to load the file. But; we now have no threading overhead, no sync conflicts, no huge memory to allocate, and we start filling the table right away.
Approach 3: If you really wanted, a third approach would be a reader/writer queue, with one dedicated thread processing file IO and enqueueing the lines, and a second that does the DataTable. Frankly, it is a lot more moving parts, and the second thread will spend 95% of its time waiting for data from the file; stick to Approach 2!

Parallel.For(1, linii.Length, index =>
{
alResult = CSVParser(linii[index], txtDelimiter, txtQualifier);
lock (dtResult)
{
DataRow drRow = dtResult.NewRow();
for (int i = 0; i < alResult.Count; i++)
{
drRow[i] = alResult[i];
}
dtResult.Rows.Add(drRow);
}
});

Related

How to optimize the counting of words and characters in a huge file using multithreading?

I have a very large text file around 1 GB.
I need to count the number of words and characters (non-space characters).
I have written the below code.
string fileName = "abc.txt";
long words = 0;
long characters = 0;
if (File.Exists(fileName))
{
using (StreamReader sr = new StreamReader(fileName))
{
string[] fields = null;
string text = sr.ReadToEnd();
fields = text.Split(' ', StringSplitOptions.RemoveEmptyEntries);
foreach (string str in fields)
{
characters += str.Length;
}
words += fields.LongLength;
}
Console.WriteLine("The word count is {0} and character count is {1}", words, characters);
}
Is there any way to make it faster using threads, someone has suggested me to use threads so that it will be faster?
I have found one issue in my code that will fail if the numbers of words or characters are greater than the long max value.
I have written this code assuming that there will be only English characters, but there can be non-English characters as well.
I am especially looking for the thread related suggestions.
Here is how you could tackle the problem of counting the non-whitespace characters of a huge text file efficiently, using parallelism. First we need a way to read blocks of characters in a streaming fashion. The native File.ReadLines method doesn't cut it, since the file is susceptible of having a single line. Below is a method that uses the StreamReader.ReadBlock method to grab blocks of characters of a specific size, and return them as an IEnumerable<char[]>.
public static IEnumerable<char[]> ReadCharBlocks(String path, int blockSize)
{
using (var reader = new StreamReader(path))
{
while (true)
{
var block = new char[blockSize];
var count = reader.ReadBlock(block, 0, block.Length);
if (count == 0) break;
if (count < block.Length) Array.Resize(ref block, count);
yield return block;
}
}
}
With this method in place, it is then quite easy to parallelize the parsing of the characters blocks using PLINQ:
public static long GetNonWhiteSpaceCharsCount(string filePath)
{
return Partitioner
.Create(ReadCharBlocks(filePath, 10000), EnumerablePartitionerOptions.NoBuffering)
.AsParallel()
.WithDegreeOfParallelism(Environment.ProcessorCount)
.Select(chars => chars
.Where(c => !Char.IsWhiteSpace(c) && !Char.IsHighSurrogate(c))
.LongCount())
.Sum();
}
What happens above is that multiple threads are reading the file and processing the blocks, but reading the file is synchronized. Only one thread at a time is allowed to fetch the next block, by calling the IEnumerator<char[]>.MoveNext method. This behavior does not resemble a pure producer-consumer setup, where one thread would be dedicated to reading the file, but in practice the performance characteristics should be the same. That's because this particular workload has low variability. Parsing each character block should take approximately the same time. So when a thread is done with reading a block, another thread should be in the waiting list for reading the next block, resulting to the combined reading operation being almost continuous.
The Partitioner configured with NoBuffering is used so that each thread acquires one block at a time. Without it the PLINQ utilizes chunk partitioning, which means that progressively each thread asks for more and more elements at a time. Chunk partitioning is not suitable in this case, because the mere act of enumerating is costly.
The worker threads are provided by the ThreadPool. The current thread participates also in the processing. So in the above example, assuming that the current thread is the application's main thread, the number of threads provided by the ThreadPool is Environment.ProcessorCount - 1.
You may need to fine-tune to operation by adjusting the blockSize (larger is better) and the MaxDegreeOfParallelism to the capabilities of your hardware. The Environment.ProcessorCount may be too many, and 2 could probably be enough.
The problem of counting the words is significantly more difficult, because a word may span more than one character blocks. It is even possible that the whole 1 GB file contains a single word. You may try to solve this problem by studying the source code of the StreamReader.ReadLine method, that has to deal with the same kind of problem. Tip: if one block ends with a non-whitespace character, and the next block starts with a non-whitespace character as well, there is certainly a word split in half there. You could keep track of the number of split-in-half words, and eventually subtract this number from the total number of words.
This is a problem that doesn't need multithreading at all! Why? Because the CPU is far faster than the disk IO! So even in a single threaded application, the program will be waiting for data to be read from the disk. Using more threads will mean more waiting.
What you want is asynchronous file IO. So, a design like this:-
main
asynchronously read a chunk of the file (one MB perhaps), calling the callback on completion
while not at end of file
wait for asynchronous read to complete
process chunk of data
end
end
asynchronous read completion callback
flag data available to process
asynchronously read next chunk of the file, calling the callback on completion
end
You may get the file's length the beginning. let it be "S" (bytes).
Then, let's take some constant "C".
Execute C threads, and let each one of them to process S/C length text.
You may read all of the file at once and load in to your memory (if you enough RAM for this), or you may let every thread to read the relevant part of the file.
First thread will process byte 0 to S/C.
Second thread will process byte S/C to 2S/C.
And so on.
After all threads finished, summarize the counts.
How is that?

Can I convert while(true) loop to EventWaitHandle?

I'm trying to process large amount of text files via Parallel.ForEach adding processed data to BlockingCollection.
The problem is that I want the Task taskWriteMergedFile to consume the collection and write them to result file at least every 800000 lines.
I guess that I can't test the collection size within the iteration because it is paralleled so I created the Task.
Can I convert while(true) loop in the task to EventWaitHandle in this case?
const int MAX_SIZE = 1000000;
static BlockingCollection<string> mergeData;
mergeData = new BlockingCollection<string>(new ConcurrentBag<string>(), MAX_SIZE);
string[] FilePaths = Directory.GetFiles("somepath");
var taskWriteMergedFile = new Task(() =>
{
while ( true )
{
if ( mergeData.Count > 800000)
{
String.Join(System.Environment.NewLine, mergeData.GetConsumingEnumerable());
//Write to file
}
Thread.Sleep(10000);
}
}, TaskCreationOptions.LongRunning);
taskWriteMergedFile.Start();
Parallel.ForEach(FilePaths, FilePath => AddToDataPool(FilePath));
mergeData.CompleteAdding();
You probably don't want to do it that way. Instead, have your task write each line to the file as it's received. If you want to limit the file size to 80,000 lines, then after the 80,000th line is written, close the current file and open a new one.
Come to think of it, what you have can't work because GetConsumingEnumerable() won't stop until the collection is marked as complete for adding. What would happen is that the thing would go through the sleep loop until there were 80,000 items in the queue, and then it would block on the String.Join until the main thread calls CompleteAdding. With enough data, you'd run out of memory.
Also, unless you have a very good reason, you shouldn't use ConcurrentBag here. Just use the default for BlockingCollection, which is ConcurrentQueue. ConcurrentBag is a rather special purpose data structure that won't perform as well as ConcurrentQueue.
So your task becomes:
var taskWriteMergedFile = new Task(() =>
{
int recordCount = 0;
foreach (var line in mergeData.GetConsumingEnumerable())
{
outputFile.WriteLine(line);
++recordCount;
if (recordCount == 80,000)
{
// If you want to do something after 80,000 lines, do it here
// and then reset the record count
recordCount = 0;
}
}
}, TaskCreationOptions.LongRunning);
That assumes, of course, that you've opened the output file somewhere else. It's probably better to open the output at the start of the task, and close it after the foreach has exited.
On another note, you probably don't want your producer loop to be parallel. You have:
Parallel.ForEach(FilePaths, FilePath => AddToDataPool(FilePath));
I don't know for sure what AddToDataPool is doing, but if it's reading a file and writing the data to the collection, you have a couple of problems. First, the disk drive can only do one thing at a time, so it ends up reading part of one file, then part of another, then part of another, etc. In order to read each chunk of the next file, it has to seek the head to the proper position. Disk head seeks are incredibly expensive--5 milliseconds or more. An eternity in CPU time. Unless you're doing heavy duty processing that takes much longer than reading the file, you're almost always better off processing one file at a time. Unless you can guarantee that the input files are on separate physical disks . . .
The second potential problem is that with multiple threads running, you can't guarantee the order in which things are written to the collection. That might not be a problem, of course, but if you expect all of the data from a single file to be grouped together in the output, that's not going to happen with multiple threads each writing multiple lines to the collection.
Just something to keep in mind.

C# Parallel.foreach - Making variables thread safe

I have been rewriting some process intensive looping to use TPL to increase speed. This is the first time I have tried threading, so want to check what I am doing is the correct way to do it.
The results are good - processing the data from 1000 Rows in a DataTable has reduced processing time from 34 minutes to 9 minutes when moving from a standard foreach loop into a Parallel.ForEach loop. For this test, I removed non thread safe operations, such as writing data to a log file and incrementing a counter.
I still need to write back into a log file and increment a counter, so i tried implementing a lock which encases the streamwriter/increment code block.
FileStream filestream = new FileStream("path_to_file.txt", FileMode.Create);
StreamWriter streamwriter = new StreamWriter(filestream);
streamwriter.AutoFlush = true;
try
{
object locker = new object();
// Lets assume we have a DataTable containing 1000 rows of data.
DataTable datatable_results;
if (datatable_results.Rows.Count > 0)
{
int row_counter = 0;
Parallel.ForEach(datatable_results.AsEnumerable(), data_row =>
{
// Process data_row as normal.
// When ready to write to log, do so.
lock (locker)
{
row_counter++;
streamwriter.WriteLine("Processing row: {0}", row_counter);
// Write any data we want to log.
}
});
}
}
catch (Exception e)
{
// Catch the exception.
}
streamwriter.Close();
The above seems to work as expected, with minimal performance costs (still 9 minutes execution time). Granted, the actions contained in the lock are hardly significant themselves - I assume that as the time taken to process code within the lock increases, the longer the thread is locked for, the more it affects processing time.
My question: is the above an efficient way of doing this or is there a different way of achieving the above that is either faster or safer?
Also, lets say our original DataTable actually contains 30000 rows. Is there anything to be gained by splitting this DataTable into chunks of 1000 rows each and then processing them in the Parallel.ForEach, instead of processing all 300000 rows in one go?
Writing to the file is expensive, you're holding a exclusive lock while writing to the file, that's bad. It's going to introduce contention.
You could add it in a buffer, then write to the file all at once. That should remove contention and provide way to scale.
if (datatable_results.Rows.Count > 0)
{
ConcurrentQueue<string> buffer = new ConcurrentQueue<string>();
Parallel.ForEach(datatable_results.AsEnumerable(), (data_row, state, index) =>
{
// Process data_row as normal.
// When ready to write to log, do so.
buffer.Enqueue(string.Format( "Processing row: {0}", index));
});
streamwriter.AutoFlush = false;
string line;
while (buffer.TryDequeue(out line))
{
streamwriter.WriteLine(line);
}
streamwriter.Flush();//Flush once when needed
}
Note that you don't need to maintain a loop counter,
Parallel.ForEach provides you one. Difference is that it is not
the counter but index. If I've changed the expected behavior you can
still add the counter back and use Interlocked.Increment to
increment it.
I see that you're using streamwriter.AutoFlush = true, that will hurt performance, you can set it to false and flush it once you're done writing all the data.
If possible, wrap the StreamWriter in using statement, so that you don't even need to flush the stream(you get it for free).
Alternatively, you could look at the logging frameworks which does their job pretty well. Example: NLog, Log4net etc.
You may try to improve this, if you avoid logging, or log into only thread specific log file (not sure if that makes sense to you)
TPL start as many threads as many cores you have Does Parallel.ForEach limits the number of active threads?.
So what you can do is:
1) Get numbers of core on target machine
2) Create a list of counters, with as many elements inside as many cores you have
3) Update counter for every core
4) Sum all them up after parallel execution terminates.
So, in practice :
//KEY(THREAD ID, VALUE: THREAD LOCAL COUNTER)
Dictionary<int,int> counters = new Dictionary<int, int>(NUMBER_OF_CORES);
....
Parallel.ForEach(datatable_results.AsEnumerable(), data_row =>
{
// Process data_row as normal.
// When ready to write to log, do so.
//lock (locker) //NO NEED FOR LOCK, EVERY THREAD UPDATES ITS _OWN_ COUNTER
//{
//row_counter++;
counters[Thread.CurrentThread.ManagedThreadId].Value +=1;
//NO WRITING< OR WRITING THREAD SPECIFIC FILE ONLY
//streamwriter.WriteLine("Processing row: {0}", row_counter);
//}
});
....
//AFTER EXECUTION OF PARALLEL LOOP SUM ALL COUNTERS AND GET TOTAL OF ALL THREADS.
The benefit of this that no locking envolved at all, which will drammatically improve performance. When you use .net concurent collections, they are always use some kind of locking inside.
This is naturally a basic idea, may not work as it expected if you copy paste. We are talking about multi threading , which is always a hard topic. But, hopefully, it provides to you some ideas to relay on.
First of all, it takes about 2 seconds to process a row in your table and perhaps a few milliseconds to increment the counter and write to the log file. With the actual processing being 1000x more than the part you need to serialize, the method doesn't matter too much.
Furthermore, the way you have implemented it is perfectly solid. There are ways to optimize it, but none that are worth implementing in your situation.
One useful way to avoid locking on the increment is to use Interlocked.Increment. It is a bit slower than x++ but much faster than lock {x++;}. In your case, though, it doesn't matter.
As for the file output, remember that the output is going to be serialized anyway, so at best you can minimize the amount of time spent in the lock. You can do this by buffering all of your output before entering the lock, then just perform the write operation inside the lock. You probably want to do async writes to avoid unnecessary blocking on I/O.
You can transfer the parallel code in new method. For example :
// Class scope
private string GetLogRecord(int rowCounter, DataRow row)
{
return string.Format("Processing row: {0}", rowCounter); // Write any data we want to log.
}
//....
Parallel.ForEach(datatable_results.AsEnumerable(), data_row =>
{
// Process data_row as normal.
// When ready to write to log, do so.
lock (locker)
row_counter++;
var logRecord = GetLogRecord(row_counter, data_row);
lock (locker)
streamwriter.WriteLine(logRecord);
});
This is my code that uses a parallel for. The concept is similar, and perhaps easier for you to implement. FYI, for debugging, I keep a regular for loop in the code and conditionally compile the parallel code. Hope this helps. The value of i in this scenario isn't the same as the number of records processed, however. You could create a counter and use a lock and add values for that. For my other code where I do have a counter, I didn't use a lock and just allowed the value to be potentially off to avoid the slower code. I have a status mechanism to indicate number of records processed. For my implementation, the slight chance that the count is not an issue - at the end of the loop I put out a message saying all the records have been processed.
#if DEBUG
for (int i = 0; i < stend.PBBIBuckets.Count; i++)
{
//int serverIndex = 0;
#else
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = m_maxThreads;
Parallel.For(0, stend.PBBIBuckets.Count, options, (i) =>
{
#endif
g1client.Message request;
DataTable requestTable;
request = new g1client.Message();
requestTable = request.GetDataTable();
requestTable.Columns.AddRange(
Locations.Columns.Cast<DataColumn>().Select(x => new DataColumn(x.ColumnName, x.DataType)).ToArray
());
FillPBBIRequestTables(requestTable, request, stend.PBBIBuckets[i], stend.BucketLen[i], stend.Hierarchies);
#if DEBUG
}
#else
});
#endif

Log to memory and then write to file

I want to know something, I have this loop that runs for all days (7 times) and then another loop inside it, that runs for all the records in the file. (about 100000), so all in all its about a 700000 times, now I want to log each processing in each loop, and log that to a file, say we are inside 1st loop for first time, and 2nd loop for first time, we log each time, what is done in a file. But the problem is that if I were to log each time, it would terribly hurt the performance, because of so many IO operations, what I was thinking is that is there any way, I could log each and every step to memory (memory stream or anything) and then at the end of outer loop, log all that memory stream data to file?
say if I have
for (int i=0; i<7; i++)
{
for (int j=0; j<RecordsCount; j++)
{
SomeOperation();
// I am logging to a file here right now
}
}
Its hurting the performance terribly, if I remove the logging file, I could finish it a lot earlier. So I was thinking it would be better to log to memory first and write all that log from memory to file. So, is there any way to do this? If there are many, what's the best?
ps: here is a logger I found http://www.codeproject.com/Articles/23424/TracerX-Logger-and-Viewer-for-NET, but I can use any other custom logger or anything else if required.
EDIT : If I use memory stream, and then write it all to file, would that give me better performance then using File.AppendAllLines as suggested in answer by Yorye and zmbq, also if would that give me any performance gain over what Jeremy commented?
This is just an example, but you can get the idea.
You really had the solution, you just needed to do it...
for (int i=0; i<7; i++)
{
var entries = new List<string>();
for (int j=0; j<RecordsCount; j++)
{
SomeOperation();
// Log into list
entries.Add("Operation #" + j + " results: " + bla bla bla);
}
// Log all entries to file at once.
File.AppendAllLines("logFile.txt", entries);
}
Why not use a proper logging framework instead of writing your own?
NLog for instance has buffering built in and it is easy to configure: https://github.com/nlog/NLog/wiki/BufferingWrapper-target
I suggest you focus on writing code that gives value to your project while reusing existing solutions for all the other stuff. That will probably make you more efficient and give better results :-)
Logging 700,000 lines to a file shouldn't take all time long as long as you adequate buffers. In fact, it shouldn't take longer if you do it inside the loop, compared to doing it at once outside the loop.
Don't use File.AppendAllLines or something similar, instead open a stream to the file, make sure you have a buffer in place, and write through it.

how to increase speed of my execution

i am creating project in c#.net. my execution process is very slow. i also found the reason for that.in one method i copied the values from one list to another.that list consists more 3000values for every row . how can i speed up this process.any body help me
for (int i = 0; i < rectTristrip.NofStrips; i++)
{
VertexList verList = new VertexList();
verList = rectTristrip.Strip[i];
GraphicsPath rectPath4 = verList.TristripToGraphicsPath();
for (int j = 0; j < rectPath4.PointCount; j++)
{
pointList.Add(rectPath4.PathPoints[j]);
}
}
This is the code slow up my procees.Rect tristirp consists lot of vertices each vertices has more 3000 values..
A profiler will tell you exactly how much time is spent on which lines and which are most important to optimize. Red-gate makes a very good one.
http://www.red-gate.com/products/ants_performance_profiler/index.htm
Like musicfreak already mentioned you should profile your code to get reliable result on what's going on. But some processes are just taking some time.
In some way you can't get rid of them, they must be done. The question is just: When they are neccessary? So maybe you can put them into some initialization phase or into another thread which will compute the results for you, while your GUI is accessible to your users.
In one of my applications i make a big query against a SQL Server. This task takes a while (built up connection, send query, wait for result, putting result into a data table, making some calculations on my own, presenting the results to the user). All of these steps are necessary and can't be make any faster. But they will be done in another thread while the user sees in the result window a 'Please wait' with a progress bar. In the meantime the user can already make some other settings in the UI (if he likes). So the UI is responsive and the user has no big problem to wait a few seconds.
So this is not a real answer, but maybe it gives you some ideas on how to solve your problem.
You can split the load into a couple of worker threads, say 3 threads each dealing with 1000 elements.
You can synchronize it with AutoResetEvent
Some suggestions, even though I think the bulk of the work is in TristripToGraphicsPath():
// Use rectTristrip.Strip.Length instead of NoOfStrips
// to let the JIT eliminate bounds checking
// .Count if it is a list instead of array
for (int i = 0; i < rectTristrip.Strip.Length; i++)
{
VertexList verList = rectTristrip.Strip[i]; // Removed 'new'
GraphicsPath rectPath4 = verList.TristripToGraphicsPath();
// Assuming pointList is infact a list, do this:
pointList.AddRange(rectPath4.PathPoints);
// Else do this:
// Use PathPoints.Length instead of PointCount
// to let the JIT eliminate bounds checking
for (int j = 0; j < rectPath4.PathPoints.Length; j++)
{
pointList.Add(rectPath4.PathPoints[j]);
}
}
And maybe verList = rectTristrip.Strip[i]; // Removed 'VertexList' to save some memory
Define variable VertexList verList above loop.

Categories