c# multithreading file reading and page parsing

c# multithreading file reading and page parsing - c#

I have a file with more than 500 000 urls. Now I want to read the file and parse every url with my function which return string message. For now everyting is working fine but the performance is not good so I need start the parsing in simulataneus threads (for example 100 threads)
ParseEngine parseEngine = new ParserEngine(parseFormulas);
StreamReader reader = new StreamReader("urls.txt");
String line = string.Empty;
while ((line = reader.ReadLine()) != null)
{
string result = parseEngine.Parse(line);
Console.WriteLine(result);
}
reader.Close();
It will be good when I can stop all the threads by button clicking and change the number of threads. Any help and tips?

Be sure to check out this article on PLINQ performance compared to other techniques for parsing a text file, line-by-line, using multi-threading.
Not only does it provide sample source code for doing something almost identical to what you want, but they also discovered a "gotcha" with PLINQ that can result in abnormally slow times. In a nutshell, if you try to use File.ReadAllLines() or StreamReader.ReadLine() you'll spoil the performance because PLINQ can't properly divide the file up that way. They solved the problem by reading all the lines into an indexed array, and THEN processing it with PLINQ.

Honestly for the performance difference I would just try parallel foreach in .net 4.0 if that is an option.
using System.Threading.Tasks;
Parallel.ForEach(enumerableList, p =>{
parseEngine.Parse(p);
});
Its a decent start to running things parallel and should minimize your thread troubleshooting headaches.

A producer/consumer setup would be good for this. One thread reading from the file and writing to a Queue, and the other threads can read from the queue.
You mentioned and example of 100 threads. If you had this many threads, you would want to read from the Queue in batches, since you'd probably have to lock the Queue before reading as a Queue is only thread safe for a single reader+writer.
I think there is a new ConcurrentQueue generic in 4.0, but I can't remember for sure.
You really only want one reader to the file.

You could use Parallel.ForEach() to schedule a thread for each item in the list. That would spread the threads out among all available processors, assuming that parseEngine takes some time to run. If parseEngine runs pretty quickly (defined as less than 250ms), increase the number of "on-demand" threads by calling ThreadPool.SetMinThreads(), which will result in more threads executing at once.

Related

How to optimize the counting of words and characters in a huge file using multithreading?

I have a very large text file around 1 GB.
I need to count the number of words and characters (non-space characters).
I have written the below code.
string fileName = "abc.txt";
long words = 0;
long characters = 0;
if (File.Exists(fileName))
{
using (StreamReader sr = new StreamReader(fileName))
{
string[] fields = null;
string text = sr.ReadToEnd();
fields = text.Split(' ', StringSplitOptions.RemoveEmptyEntries);
foreach (string str in fields)
{
characters += str.Length;
}
words += fields.LongLength;
}
Console.WriteLine("The word count is {0} and character count is {1}", words, characters);
}
Is there any way to make it faster using threads, someone has suggested me to use threads so that it will be faster?
I have found one issue in my code that will fail if the numbers of words or characters are greater than the long max value.
I have written this code assuming that there will be only English characters, but there can be non-English characters as well.
I am especially looking for the thread related suggestions.

Here is how you could tackle the problem of counting the non-whitespace characters of a huge text file efficiently, using parallelism. First we need a way to read blocks of characters in a streaming fashion. The native File.ReadLines method doesn't cut it, since the file is susceptible of having a single line. Below is a method that uses the StreamReader.ReadBlock method to grab blocks of characters of a specific size, and return them as an IEnumerable<char[]>.
public static IEnumerable<char[]> ReadCharBlocks(String path, int blockSize)
{
using (var reader = new StreamReader(path))
{
while (true)
{
var block = new char[blockSize];
var count = reader.ReadBlock(block, 0, block.Length);
if (count == 0) break;
if (count < block.Length) Array.Resize(ref block, count);
yield return block;
}
}
}
With this method in place, it is then quite easy to parallelize the parsing of the characters blocks using PLINQ:
public static long GetNonWhiteSpaceCharsCount(string filePath)
{
return Partitioner
.Create(ReadCharBlocks(filePath, 10000), EnumerablePartitionerOptions.NoBuffering)
.AsParallel()
.WithDegreeOfParallelism(Environment.ProcessorCount)
.Select(chars => chars
.Where(c => !Char.IsWhiteSpace(c) && !Char.IsHighSurrogate(c))
.LongCount())
.Sum();
}
What happens above is that multiple threads are reading the file and processing the blocks, but reading the file is synchronized. Only one thread at a time is allowed to fetch the next block, by calling the IEnumerator<char[]>.MoveNext method. This behavior does not resemble a pure producer-consumer setup, where one thread would be dedicated to reading the file, but in practice the performance characteristics should be the same. That's because this particular workload has low variability. Parsing each character block should take approximately the same time. So when a thread is done with reading a block, another thread should be in the waiting list for reading the next block, resulting to the combined reading operation being almost continuous.
The Partitioner configured with NoBuffering is used so that each thread acquires one block at a time. Without it the PLINQ utilizes chunk partitioning, which means that progressively each thread asks for more and more elements at a time. Chunk partitioning is not suitable in this case, because the mere act of enumerating is costly.
The worker threads are provided by the ThreadPool. The current thread participates also in the processing. So in the above example, assuming that the current thread is the application's main thread, the number of threads provided by the ThreadPool is Environment.ProcessorCount - 1.
You may need to fine-tune to operation by adjusting the blockSize (larger is better) and the MaxDegreeOfParallelism to the capabilities of your hardware. The Environment.ProcessorCount may be too many, and 2 could probably be enough.
The problem of counting the words is significantly more difficult, because a word may span more than one character blocks. It is even possible that the whole 1 GB file contains a single word. You may try to solve this problem by studying the source code of the StreamReader.ReadLine method, that has to deal with the same kind of problem. Tip: if one block ends with a non-whitespace character, and the next block starts with a non-whitespace character as well, there is certainly a word split in half there. You could keep track of the number of split-in-half words, and eventually subtract this number from the total number of words.

This is a problem that doesn't need multithreading at all! Why? Because the CPU is far faster than the disk IO! So even in a single threaded application, the program will be waiting for data to be read from the disk. Using more threads will mean more waiting.
What you want is asynchronous file IO. So, a design like this:-
main
asynchronously read a chunk of the file (one MB perhaps), calling the callback on completion
while not at end of file
wait for asynchronous read to complete
process chunk of data
end
end
asynchronous read completion callback
flag data available to process
asynchronously read next chunk of the file, calling the callback on completion
end

You may get the file's length the beginning. let it be "S" (bytes).
Then, let's take some constant "C".
Execute C threads, and let each one of them to process S/C length text.
You may read all of the file at once and load in to your memory (if you enough RAM for this), or you may let every thread to read the relevant part of the file.
First thread will process byte 0 to S/C.
Second thread will process byte S/C to 2S/C.
And so on.
After all threads finished, summarize the counts.
How is that?

Is it possible to have any dataflow block type send multiple intermediate results as a result of a single input?

Is it possible to get TransformManyBlocks to send intermediate results as they are created to the next step instead if waiting for the entire IEnumerable<T> to be filled?
All testing I've done shows that TransformManyBlock only sends a result to the next block when it is finished; the next block then reads those items one at a time.
It seems like basic functionality but I can't find any examples of this anywhere.
The use case is processing chunks of a file as they are read. In my case there's a modulus of so many lines needed before I can process anything so a direct stream won't work.
They kludge I've come up with is to create two pipelines:
a "processing" dataflow network the processes the chunks of data as the become available
"producer" dataflow network that ends where the file is broken into
chunks then posted to the start of the "processing" network that actually transforms the data.
The "producer" network needs to be seeded with the starting point of the "processing" network.
Not a good long term solution since additional processing options will be needed and it's not flexible.
Is it possible to have any dataflow block type to send multiple intermediate results as created to a single input? Any pointers to working code?

You probably need to create your IEnumerables by using an iterator. This way an item will be propagated downstream after every yield command. The only problem is that yielding from lambda functions is not supported in C#, so you'll have to use a local function instead. Example:
var block = new TransformManyBlock<string, string>(filePath => ReadLines(filePath));
IEnumerable<string> ReadLines(string filePath)
{
string[] lines = File.ReadAllLines(filePath);
foreach (var line in lines)
{
yield return line; // Immediately offered to any linked block
}
}

How to make these IO reads parallel and performant

I have a list of files: List<string> Files in my C#-based WPF application.
Files contains ~1,000,000 unique file paths.
I ran a profiler on my application. When I try to do parallel operations, it's REALLY laggy because it's IO bound. It even lags my UI threads, despite not having dispatchers going to them (note the two lines I've marked down):
Files.AsParallel().ForAll(x =>
{
char[] buffer = new char[0x100000];
using (FileStream stream = new FileStream(x, FileMode.Open, FileAccess.Read)) // EXTREMELY SLOW
using (StreamReader reader = new StreamReader(stream, true))
{
while (true)
{
int bytesRead = reader.Read(buffer, 0, buffer.Length); // EXTREMELY SLOW
if (bytesRead <= 0)
{
break;
}
}
}
}
These two lines of code take up ~70% of my entire profile test runs. I want to achieve maximum parallelization for IO, while keeping performance such that it doesn't cripple my app's UI entirely. There is nothing else affecting my performance. Proof: Using Files.ForEach doesn't cripple my UI, and WithDegreeOfParallelism helps out too (but, I am writing an application that is supposed to be used on any PC, so I cannot assume a specific degree of parallelism for this computation); also, the PC I am on has a solid-state hard disk. I have searched on StackOverflow, and have found links that talk about using asynchronous IO read methods. I'm not sure how they apply in this case, though. Perhaps someone can shed some light? Also; how can you tune down the constructor time of a new FileStream; is that even possible?
Edit: Well, here's something strange that I've noticed...the UI doesn't get crushed so bad when I swap Read for ReadAsync while still using AsParallel. Simply awaiting the task created by ReadAsync to finish causes my UI thread to maintain some degree of usability. I think this does some sort of asynchronous scheduling that is done in this method to maintain optimal disk usage while not crushing existing threads. And on that note, is there ever a chance that the operating system contends for existing threads to do IO, such as my application's UI thread? I seriously don't understand why its slowing my UI thread. Is the OS scheduling work from IO on my thread or something? Did they do something to the CLR to eat threads that haven't been explicitly affinated using Thread.BeginThreadAffinity or something? Memory is not an issue; I am looking at Task Manager and there is plenty.

I don't agree with your assertion that you can't use WithDegreeOfParallelism because it will be used on any PC. You can base it on number of CPU. By not using WithDegreeOfParallelism you are going to get crushed on some PCs.
You optimized for a solid state disc where heads don't have to move. I don't think this unrestricted parallel design will hold up on regular disc (any PC).
I would try a BlockingCollection with 3 queues : FileStream, StreamReader, and ObservableCollection. Limit the FileStream to like 4 - it just has to stay ahead of StreamReader. And no parallelism.
A single head is a single head. It cannot read from 5 or 5000 files faster than it can read from 1. On solid state the is no penalty switching from file to file - on a regular disc there is a significant penalty. If your files are fragmented there is a significant penalty (on regular disc).
You don't show what the data write but the next step would be to put the write in a another queue with a BlockingCollection in the BlockingCollection.
E.G. sb.Append(text); in a separate queue.
But that may be more overhead than it is worth.
Keep that head as close to 100% busy on a single contiguous file is the best you are going to do.
private async Task<string> ReadTextAsync(string filePath)
{
using (FileStream sourceStream = new FileStream(filePath,
FileMode.Open, FileAccess.Read, FileShare.Read,
bufferSize: 4096, useAsync: true))
{
StringBuilder sb = new StringBuilder();
byte[] buffer = new byte[0x1000];
int numRead;
while ((numRead = await sourceStream.ReadAsync(buffer, 0, buffer.Length)) != 0)
{
string text = Encoding.Unicode.GetString(buffer, 0, numRead);
sb.Append(text);
}
return sb.ToString();
}
}

File access is inherently not parallel. You can only benefit from parallelism, if you treat some files while reading others. It makes no sense to wait for the disk in parallel.
Instead of waiting 100 000 time 1 ms for disk access, you program to wait once 100 000 ms = 100 s.

Unfortunately, it's a vague question without a reproducible code example. So it's impossible to offer specific advice. But my two recommendations are:
Pass a ParallelOptions instance where you've set the MaxDegreeOfParallelism property to something reasonably low. Something like the number of cores in your system, or even that number minus one.
Make sure you aren't expecting too much from the disk. You should start with the known speed of the disk and controller, and compare that with the data throughput you're getting. Adjust the degree of parallelism even lower if it looks like you're already at or near the maximum theoretical throughput.
Performance optimization is all about setting realistic goals based on known limitations of the hardware, measuring your actual performance, and then looking at how you can improve the costliest elements of your algorithm. If you haven't done the first two steps yet, you really should start there. :)

I got it working; the problem was me trying to use an ExtendedObservableCollection with AddRange instead of calling Add multiple times in every UI dispatch...for some reason, the performance of the methods people list in here is actually slower in my situation: ObservableCollection Doesn't support AddRange method, so I get notified for each item added, besides what about INotifyCollectionChanging?
I think because it forces you to call change notifications with .Reset (reload) instead of .Add (a diff), there is some sort of logic in place that causes bottlenecks.
I apologize for not posting the rest of the code; I was really thrown off by this, and I'll explain why in a moment. Also, a note for others who come across the same issue, this might help. The main problem with profiling tools in this scenario is that they don't help much here. Most of your app's time will be spent reading files regardless. So you have to unit test all dispatchers separately.

C# read text file lines multi thread

I want to write a fast multi thread program using c# that read a file.
so the file must be split into some parts and each part process in different thread. for ex:
Line1
Line2
Line3
Line4
must split to 4 lines like this:
Line1 => thread 1
Line2 => thread 2
Line3 => thread 3
Line4 = > thread 4
i used the StreamReader.readLine() but it cant read specify line.
Comment: its necessary to speedup the program so i want to read file in separate threads.

Unless you're using fixed-length lines, this isn't possible.
Why? Because in order to determine where the "lines" split, you need to find the newline characters... which means you need to read the file first.
Now, if you simply want to perform some extra "processing" after you read in each line - that is possible and relatively straight-forward using a ThreadPool.

You should read the file in a single thread - but then spawn the processing of each line to a different thread, e.g. by adding it to a producer/consumer queue.
Even if you could seek to a specific line in a text file (which in general you can't) you really don't want the disk thrashing around - that'll only slow things down. The fastest way to get the data off the disk is to read it sequentially. By all means defer everything about handling the line beyond "decoding the binary data to text" to other threads, but you really don't want the IO to be in multiple threads.

AFAIK .NET doesn't support parallel stream reading. If you want to process every line you may use File.ReadAllLines. It returns an array of strings. Then use you can use PLINQ.
var result = File.ReadAllLine("path")
.AsParallel()
.Select(s => DoSthWithString(s))
.ToList();

You're not going to be able to speed up the actual reading because you're going to have tremendous locking issues keeping everything straight.
Since a text file is an unstructured file, ie. each line can be of different length, you have no choice but to read each line after the other, one by one.
Now, what you can do is process those lines on different threads, but the actual reading, keep that to one thread.
But, before you do that, are you sure you even have to do this? Is this a bottleneck? If not, fix the bottleneck first and see how far you get.

Your StreamReader is connected to a stream class. Using the stream class you can .Seek to a particular byte location.
Like others have said, this probably isn't a good idea, but it can be done.

I would split the file before hand. Say the file is 1000 lines. Split it into 10 files of 100 lines. Have a thread process each file.

Parallel programming in C#

I'm interested in learning about parallel programming in C#.NET (not like everything there is to know, but the basics and maybe some good-practices), therefore I've decided to reprogram an old program of mine which is called ImageSyncer. ImageSyncer is a really simple program, all it does is to scan trough a folder and find all files ending with .jpg, then it calculates the new position of the files based on the date they were taken (parsing of xif-data, or whatever it's called). After a location has been generated the program checks for any existing files at that location, and if one exist it looks at the last write-time of both the file to copy, and the file "in its way". If those are equal the file is skipped. If not a md5 checksum of both files is created and matched. If there is no match the file to be copied is given a new location to be copied to (for instance, if it was to be copied to "C:\test.jpg" it's copied to "C:\test(1).jpg" instead). The result of this operation is populated into a queue of a struct-type that contains two strings, the original file and the position to copy it to. Then that queue is iterated over untill it is empty and the files are copied.
In other words there are 4 operations:
1. Scan directory for jpegs
2. Parse files for xif and generate copy-location
3. Check for file existence and if needed generate new path
4. Copy files
And so I want to rewrite this program to make it paralell and be able to perform several of the operations at the same time, and I was wondering what the best way to achieve that would be. I've came up with two different models I can think of, but neither one of them might be any good at all. The first one is to parallelize the 4 steps of the old program, so that when step one is to be executed it's done on several threads, and when the entire of step 1 is finished step 2 is began. The other one (which I find more interesting because I have no idea of how to do that) is to create a sort of worker and consumer model, so when a thread is finished with step 1 another one takes over and performs step 2 at that object (or something like that). But as said, I don't know if any of these are any good solutions. Also, I don't know much about parallel programming at all. I know how to make a thread, and how to make it perform a function taking in an object as its only parameter, and I've also used the BackgroundWorker-class on one occasion, but I'm not that familiar with any of them.
Any input would be appreciated.

There are few a options:
Parallel LINQ: Running Queries On Multi-Core Processors
Task Parallel Library (TPL): Optimize Managed Code For Multi-Core Machines
If you are interested in basic threading primitives and concepts: Threading in C#
[But as #John Knoeller pointed out, the example you gave is likely to be sequential I/O bound]

This is the reference I use for C# thread: http://www.albahari.com/threading/
As a single PDF: http://www.albahari.com/threading/threading.pdf
For your second approach:
I've worked on some producer/consumer multithreaded apps where each task is some code that loops for ever. An external "initializer" starts a separate thread for each task and initializes an EventWaitHandle for each task. For each task is a global queue that can be used to produce/consume input.
In your case, your external program would add each directory to the queue for Task1, and Set the EventWaitHandler for Task1. Task 1 would "wake up" from its EventWaitHandler, get the count of directories in its queue, and then while the count is greater than 0, get the directory from the queue, scan for all the .jpgs, and add each .jpg location to a second queue, and set the EventWaitHandle for task 2. Task 2 reads its input, processes it, forwards it to a queue for Task 3...
It can be a bit of a pain getting all the locking to work right (I basically lock any access to the queue, even something as simple as getting its count). .NET 4.0 is supposed to have data structures that will automatically support a producer/consumer queue with no locks.

Interesting problem.
I came up with two approaches. The first is based on PLinq and the second is based on te Rx Framework.
The first one iterates through the files in parallel.
The second one generates asynchronously the files from the directory.
Here is how it looks like in a much simplified version (The first method does require .Net 4.0 since it uses PLinq)
string direcory = "Mydirectory";
var jpegFiles = System.IO.Directory.EnumerateFiles(direcory,"*.jpg");
// -- PLinq --------------------------------------------
jpegFiles
.AsParallel()
.Select(imageFile => new {OldLocation = imageFile, NewLocation = GenerateCopyLocation(imageFile) })
.Do(fileInfo =>
{
if (!File.Exists(fileInfo.NewLocation ) ||
(File.GetCreationTime(fileInfo.NewLocation)) != (File.GetCreationTime(fileInfo.NewLocation)))
File.Copy(fileInfo.OldLocation,fileInfo.NewLocation);
})
.Run();
// -----------------------------------------------------
//-- Rx Framework ---------------------------------------------
var resetEvent = new AutoResetEvent(false);
var doTheWork =
jpegFiles.ToObservable()
.Select(imageFile => new {OldLocation = imageFile, NewLocation = GenerateCopyLocation(imageFile) })
.Subscribe( fileInfo =>
{
if (!File.Exists(fileInfo.NewLocation ) ||
(File.GetCreationTime(fileInfo.NewLocation)) != (File.GetCreationTime(fileInfo.NewLocation)))
File.Copy(fileInfo.OldLocation,fileInfo.NewLocation);
},() => resetEvent.Set());
resetEvent.WaitOne();
doTheWork.Dispose();
// -----------------------------------------------------

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.