C# read text file lines multi thread

C# read text file lines multi thread - c#

I want to write a fast multi thread program using c# that read a file.
so the file must be split into some parts and each part process in different thread. for ex:
Line1
Line2
Line3
Line4
must split to 4 lines like this:
Line1 => thread 1
Line2 => thread 2
Line3 => thread 3
Line4 = > thread 4
i used the StreamReader.readLine() but it cant read specify line.
Comment: its necessary to speedup the program so i want to read file in separate threads.

Unless you're using fixed-length lines, this isn't possible.
Why? Because in order to determine where the "lines" split, you need to find the newline characters... which means you need to read the file first.
Now, if you simply want to perform some extra "processing" after you read in each line - that is possible and relatively straight-forward using a ThreadPool.

You should read the file in a single thread - but then spawn the processing of each line to a different thread, e.g. by adding it to a producer/consumer queue.
Even if you could seek to a specific line in a text file (which in general you can't) you really don't want the disk thrashing around - that'll only slow things down. The fastest way to get the data off the disk is to read it sequentially. By all means defer everything about handling the line beyond "decoding the binary data to text" to other threads, but you really don't want the IO to be in multiple threads.

AFAIK .NET doesn't support parallel stream reading. If you want to process every line you may use File.ReadAllLines. It returns an array of strings. Then use you can use PLINQ.
var result = File.ReadAllLine("path")
.AsParallel()
.Select(s => DoSthWithString(s))
.ToList();

You're not going to be able to speed up the actual reading because you're going to have tremendous locking issues keeping everything straight.
Since a text file is an unstructured file, ie. each line can be of different length, you have no choice but to read each line after the other, one by one.
Now, what you can do is process those lines on different threads, but the actual reading, keep that to one thread.
But, before you do that, are you sure you even have to do this? Is this a bottleneck? If not, fix the bottleneck first and see how far you get.

Your StreamReader is connected to a stream class. Using the stream class you can .Seek to a particular byte location.
Like others have said, this probably isn't a good idea, but it can be done.

I would split the file before hand. Say the file is 1000 lines. Split it into 10 files of 100 lines. Have a thread process each file.

Related

Is it possible to have any dataflow block type send multiple intermediate results as a result of a single input?

Is it possible to get TransformManyBlocks to send intermediate results as they are created to the next step instead if waiting for the entire IEnumerable<T> to be filled?
All testing I've done shows that TransformManyBlock only sends a result to the next block when it is finished; the next block then reads those items one at a time.
It seems like basic functionality but I can't find any examples of this anywhere.
The use case is processing chunks of a file as they are read. In my case there's a modulus of so many lines needed before I can process anything so a direct stream won't work.
They kludge I've come up with is to create two pipelines:
a "processing" dataflow network the processes the chunks of data as the become available
"producer" dataflow network that ends where the file is broken into
chunks then posted to the start of the "processing" network that actually transforms the data.
The "producer" network needs to be seeded with the starting point of the "processing" network.
Not a good long term solution since additional processing options will be needed and it's not flexible.
Is it possible to have any dataflow block type to send multiple intermediate results as created to a single input? Any pointers to working code?

You probably need to create your IEnumerables by using an iterator. This way an item will be propagated downstream after every yield command. The only problem is that yielding from lambda functions is not supported in C#, so you'll have to use a local function instead. Example:
var block = new TransformManyBlock<string, string>(filePath => ReadLines(filePath));
IEnumerable<string> ReadLines(string filePath)
{
string[] lines = File.ReadAllLines(filePath);
foreach (var line in lines)
{
yield return line; // Immediately offered to any linked block
}
}

c# how to grep through ~300mb log file quickly

im trying to read in a log file in c# thats huge - approx 300mbs of raw text data. ive been testing my program on smaller files approx 1mb which stores all log messages into a string[] array and searching with contains.
however that is too slow and takes up too much memory, i will never be able to process the 300mb log file. i need a way to grep the file, which quickly filters through it finding useful data and printing the line of log information corresponding to the search.
the big question is scale, i think 300mb will be my max, but need my program to handle it. what functions, data structions, searching can i use that will scale well with speed and efficiency to read a log file that big

File.ReadLines is probably your best bet as it gives you an IEnumerable of lines of the text file and reads them lazily as you iterate over the IEnumerable. You can then use whatever method for searching the line you'd like to use (Regex, Contains, etc) and do something with it. My example below spawns a thread to search the line and output it to the console, but you can do just about anything. Of course, TEST, TEST, TEST on large files to see your performance mileage. I imagine if each individual thread spawned below takes too long, you can run into a thread limit.
IEnumerable<string> lines = File.ReadLines("myLargeFile.txt");
foreach (string line in lines) {
string lineInt = line;
(new Thread(() => {
if (lineInt.Contains(keyword)) {
Console.WriteLine(lineInt);
}
})).Start();
}
EDIT: Through my own testing, this is obviously faster:
foreach (string lineInt in File.ReadLines("myLargeFile.txt").Where(lineInt => lineInt.Contains(keyword))) {
Console.WriteLine(lineInt);
}

c# multithreading file reading and page parsing

I have a file with more than 500 000 urls. Now I want to read the file and parse every url with my function which return string message. For now everyting is working fine but the performance is not good so I need start the parsing in simulataneus threads (for example 100 threads)
ParseEngine parseEngine = new ParserEngine(parseFormulas);
StreamReader reader = new StreamReader("urls.txt");
String line = string.Empty;
while ((line = reader.ReadLine()) != null)
{
string result = parseEngine.Parse(line);
Console.WriteLine(result);
}
reader.Close();
It will be good when I can stop all the threads by button clicking and change the number of threads. Any help and tips?

Be sure to check out this article on PLINQ performance compared to other techniques for parsing a text file, line-by-line, using multi-threading.
Not only does it provide sample source code for doing something almost identical to what you want, but they also discovered a "gotcha" with PLINQ that can result in abnormally slow times. In a nutshell, if you try to use File.ReadAllLines() or StreamReader.ReadLine() you'll spoil the performance because PLINQ can't properly divide the file up that way. They solved the problem by reading all the lines into an indexed array, and THEN processing it with PLINQ.

Honestly for the performance difference I would just try parallel foreach in .net 4.0 if that is an option.
using System.Threading.Tasks;
Parallel.ForEach(enumerableList, p =>{
parseEngine.Parse(p);
});
Its a decent start to running things parallel and should minimize your thread troubleshooting headaches.

A producer/consumer setup would be good for this. One thread reading from the file and writing to a Queue, and the other threads can read from the queue.
You mentioned and example of 100 threads. If you had this many threads, you would want to read from the Queue in batches, since you'd probably have to lock the Queue before reading as a Queue is only thread safe for a single reader+writer.
I think there is a new ConcurrentQueue generic in 4.0, but I can't remember for sure.
You really only want one reader to the file.

You could use Parallel.ForEach() to schedule a thread for each item in the list. That would spread the threads out among all available processors, assuming that parseEngine takes some time to run. If parseEngine runs pretty quickly (defined as less than 250ms), increase the number of "on-demand" threads by calling ThreadPool.SetMinThreads(), which will result in more threads executing at once.

Should I build a string first and then write to file?

A program I am working on right now has to generate a file. Is it better for me to generate the file's contents as a string first and then write that string to the file, or should I just directly add the contents to the file?
Are there any advantages of one over the other?
The file will be about 0.5 - 1MB in size.

If you write to a file as-you-go, you'll have the benefit of not keeping everything in memory, if it's a big enough file and you constantly flush the stream.
However, you'll be more likely to run into problems with a partially-written file, since you're doing your IO over a period of time instead of in a single shot.
Personally, I'd build it up using a StringBuilder, and then write it all to disk in a single shot.

I think it's a better idea, in general, to create a StreamWriter and just write to it. Why keep things in memory when you don't have to? And it's a whole lot easier. For example:
using (var writer = new StreamWriter("filename"))
{
writer.WriteLine(header);
// write all your data with Write and WriteLine,
// taking advantage of composite formatting
}
If you want to build multiple lines with StringBuilder you have to write something like:
var sb = new StringBuilder();
sb.AppendLine(string.Format("{0:N0} blocks read", blocksRead));
// etc., etc.
// and finally write it to file
File.WriteAllText("filename", sb.ToString());
There are other options, of course. You could build the lines into a List<string> and then use File.WriteAllLines. Or you could write to a StringStream and then write that to the file. But all of those approaches have you handling the data multiple times. Just open the StreamWriter and write.
The primary reasons I think it's a better idea in general to go directly to output:
You don't have to refactor your code when it turns out that your output data is too big to fit in memory.
The planned destination is the file anyway, so why fool with formatting it in memory before writing to the file?
The API for writing multiple lines to a text file is, in my opinion, cleaner than the API for adding lines to a StringBuilder.

I think it is better to use string or stringbuilder to store your data then you can write to file using File.Write functions.

Parallel programming in C#

I'm interested in learning about parallel programming in C#.NET (not like everything there is to know, but the basics and maybe some good-practices), therefore I've decided to reprogram an old program of mine which is called ImageSyncer. ImageSyncer is a really simple program, all it does is to scan trough a folder and find all files ending with .jpg, then it calculates the new position of the files based on the date they were taken (parsing of xif-data, or whatever it's called). After a location has been generated the program checks for any existing files at that location, and if one exist it looks at the last write-time of both the file to copy, and the file "in its way". If those are equal the file is skipped. If not a md5 checksum of both files is created and matched. If there is no match the file to be copied is given a new location to be copied to (for instance, if it was to be copied to "C:\test.jpg" it's copied to "C:\test(1).jpg" instead). The result of this operation is populated into a queue of a struct-type that contains two strings, the original file and the position to copy it to. Then that queue is iterated over untill it is empty and the files are copied.
In other words there are 4 operations:
1. Scan directory for jpegs
2. Parse files for xif and generate copy-location
3. Check for file existence and if needed generate new path
4. Copy files
And so I want to rewrite this program to make it paralell and be able to perform several of the operations at the same time, and I was wondering what the best way to achieve that would be. I've came up with two different models I can think of, but neither one of them might be any good at all. The first one is to parallelize the 4 steps of the old program, so that when step one is to be executed it's done on several threads, and when the entire of step 1 is finished step 2 is began. The other one (which I find more interesting because I have no idea of how to do that) is to create a sort of worker and consumer model, so when a thread is finished with step 1 another one takes over and performs step 2 at that object (or something like that). But as said, I don't know if any of these are any good solutions. Also, I don't know much about parallel programming at all. I know how to make a thread, and how to make it perform a function taking in an object as its only parameter, and I've also used the BackgroundWorker-class on one occasion, but I'm not that familiar with any of them.
Any input would be appreciated.

There are few a options:
Parallel LINQ: Running Queries On Multi-Core Processors
Task Parallel Library (TPL): Optimize Managed Code For Multi-Core Machines
If you are interested in basic threading primitives and concepts: Threading in C#
[But as #John Knoeller pointed out, the example you gave is likely to be sequential I/O bound]

This is the reference I use for C# thread: http://www.albahari.com/threading/
As a single PDF: http://www.albahari.com/threading/threading.pdf
For your second approach:
I've worked on some producer/consumer multithreaded apps where each task is some code that loops for ever. An external "initializer" starts a separate thread for each task and initializes an EventWaitHandle for each task. For each task is a global queue that can be used to produce/consume input.
In your case, your external program would add each directory to the queue for Task1, and Set the EventWaitHandler for Task1. Task 1 would "wake up" from its EventWaitHandler, get the count of directories in its queue, and then while the count is greater than 0, get the directory from the queue, scan for all the .jpgs, and add each .jpg location to a second queue, and set the EventWaitHandle for task 2. Task 2 reads its input, processes it, forwards it to a queue for Task 3...
It can be a bit of a pain getting all the locking to work right (I basically lock any access to the queue, even something as simple as getting its count). .NET 4.0 is supposed to have data structures that will automatically support a producer/consumer queue with no locks.

Interesting problem.
I came up with two approaches. The first is based on PLinq and the second is based on te Rx Framework.
The first one iterates through the files in parallel.
The second one generates asynchronously the files from the directory.
Here is how it looks like in a much simplified version (The first method does require .Net 4.0 since it uses PLinq)
string direcory = "Mydirectory";
var jpegFiles = System.IO.Directory.EnumerateFiles(direcory,"*.jpg");
// -- PLinq --------------------------------------------
jpegFiles
.AsParallel()
.Select(imageFile => new {OldLocation = imageFile, NewLocation = GenerateCopyLocation(imageFile) })
.Do(fileInfo =>
{
if (!File.Exists(fileInfo.NewLocation ) ||
(File.GetCreationTime(fileInfo.NewLocation)) != (File.GetCreationTime(fileInfo.NewLocation)))
File.Copy(fileInfo.OldLocation,fileInfo.NewLocation);
})
.Run();
// -----------------------------------------------------
//-- Rx Framework ---------------------------------------------
var resetEvent = new AutoResetEvent(false);
var doTheWork =
jpegFiles.ToObservable()
.Select(imageFile => new {OldLocation = imageFile, NewLocation = GenerateCopyLocation(imageFile) })
.Subscribe( fileInfo =>
{
if (!File.Exists(fileInfo.NewLocation ) ||
(File.GetCreationTime(fileInfo.NewLocation)) != (File.GetCreationTime(fileInfo.NewLocation)))
File.Copy(fileInfo.OldLocation,fileInfo.NewLocation);
},() => resetEvent.Set());
resetEvent.WaitOne();
doTheWork.Dispose();
// -----------------------------------------------------

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.