Read and process files in parallel C# - c#

I have very big files that I have to read and process. Can this be done in parallel using Threading?
Here is a bit of code that I've done. But it doesen't seem to get a shorter execution time the reading and processing the files one after the other.
String[] files = openFileDialog1.FileNames;
Parallel.ForEach(files, f =>
{
readTraceFile(f);
});
private void readTraceFile(String file)
{
StreamReader reader = new StreamReader(file);
String line;
while ((line = reader.ReadLine()) != null)
{
String pattern = "\\s{4,}";
foreach (String trace in Regex.Split(line, pattern))
{
if (trace != String.Empty)
{
String[] details = Regex.Split(trace, "\\s+");
Instruction instruction = new Instruction(details[0],
int.Parse(details[1]),
int.Parse(details[2]));
Console.WriteLine("computing...");
instructions.Add(instruction);
}
}
}
}

It looks like your application's performance is mostly limited by IO. However, you still have a bit of CPU-bound work in your code. These two bits of work are interdependent: your CPU-bound work cannot start until the IO has done its job, and the IO does not move on to the next work item until your CPU has finished with the previous one. They're both holding each other up. Therefore, it is possible (explained at the very bottom) that you will see an improvement in throughput if you perform your IO- and CPU-bound work in parallel, like so:
void ReadAndProcessFiles(string[] filePaths)
{
// Our thread-safe collection used for the handover.
var lines = new BlockingCollection<string>();
// Build the pipeline.
var stage1 = Task.Run(() =>
{
try
{
foreach (var filePath in filePaths)
{
using (var reader = new StreamReader(filePath))
{
string line;
while ((line = reader.ReadLine()) != null)
{
// Hand over to stage 2 and continue reading.
lines.Add(line);
}
}
}
}
finally
{
lines.CompleteAdding();
}
});
var stage2 = Task.Run(() =>
{
// Process lines on a ThreadPool thread
// as soon as they become available.
foreach (var line in lines.GetConsumingEnumerable())
{
String pattern = "\\s{4,}";
foreach (String trace in Regex.Split(line, pattern))
{
if (trace != String.Empty)
{
String[] details = Regex.Split(trace, "\\s+");
Instruction instruction = new Instruction(details[0],
int.Parse(details[1]),
int.Parse(details[2]));
Console.WriteLine("computing...");
instructions.Add(instruction);
}
}
}
});
// Block until both tasks have completed.
// This makes this method prone to deadlocking.
// Consider using 'await Task.WhenAll' instead.
Task.WaitAll(stage1, stage2);
}
I highly doubt that it's your CPU work holding things up, but if it happens to be the case, you can also parallelise stage 2 like so:
var stage2 = Task.Run(() =>
{
var parallelOptions = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount };
Parallel.ForEach(lines.GetConsumingEnumerable(), parallelOptions, line =>
{
String pattern = "\\s{4,}";
foreach (String trace in Regex.Split(line, pattern))
{
if (trace != String.Empty)
{
String[] details = Regex.Split(trace, "\\s+");
Instruction instruction = new Instruction(details[0],
int.Parse(details[1]),
int.Parse(details[2]));
Console.WriteLine("computing...");
instructions.Add(instruction);
}
}
});
});
Mind you, if your CPU work component is negligible in comparison to the IO component, you won't see much speed-up. The more even the workload is, the better the pipeline is going to perform in comparison with sequential processing.
Since we're talking about performance note that I am not particularly thrilled about the number of blocking calls in the above code. If I were doing this in my own project, I would have gone the async/await route. I chose not to do so in this case because I wanted to keep things easy to understand and easy to integrate.

From the look of what you are trying to do, you are almost certainly I/O bound. Attempting parallel processing in the case will not help and may in fact slow down processing due to addition seek operations on the disk drives (unless you can have the data split over multiple spindles).

Try processing the lines in parallel instead. For example:
var q = from file in files
from line in File.ReadLines(file).AsParallel() // for smaller files File.ReadAllLines(file).AsParallel() might be faster
from trace in line.Split(new [] {" "}, StringSplitOptions.RemoveEmptyEntries) // split by 4 spaces and no need for trace != "" check
let details = trace.Split(null as char[], StringSplitOptions.RemoveEmptyEntries) // like Regex.Split(trace, "\\s+") but removes empty strings too
select new Instruction(details[0], int.Parse(details[1]), int.Parse(details[2]));
List<Instruction> instructions = q.ToList(); // all of the file reads and work is done here with .ToList
Random access to a non-SSD hard drive (when you try to read/write different files at the same time or a fragmented file) is usually much slower than sequential access (for example reading single defragmented file), so I expect processing single file in parallel to be faster with defragmented files.
Also, sharing resources across the threads (for example Console.Write or adding to a thread safe blocking collection) can slow down or block/deadlock the execution, because some of the threads will have to wait for the other threads to finish accessing that resource.

var entries = new ConcurrentBag<object>();
var files = Directory.GetFiles(path, "*.txt", SearchOption.AllDirectories);
int fileCounter = 0;
Parallel.ForEach(files.ToList(), file =>
{
var lines = File.ReadAllLines(file, Encoding.Default);
entries.Add(new { lineCount = lines.Length });
Interlocked.Increment(ref fileCounter);
});

Related

Task.WhenAll recursion inconsistent results compared to for loop

I have a method where I need recursion to get a hierarchy of files and folders from an API (graph). When I do my recursion inside of a for loop it works as expected and returns a hierarchy with 665 files. This takes about a minute though because it only fetches one folder at a time, whereas doing it with Task.WhenAll only takes 10 seconds.
When using Task.WhenAll I get inconsistent results though, it will return the hierarchy with anywhere from 661 to 665 files depending on the run, with the exact same code. i'm using the variable totalFileCount as an indication of how many files it has found.
Obviously i'm doing something wrong but I can't quite figure out what. Any help is greatly appreciated!
For loop
for (int i = 0; i < folders.Count; i++)
{
var folder = folders[i];
await GetSharePointHierarchy(folder, folderItem, $"{outPath}{folder.Name}\\");
}
Task.WhenAll
var tasks = new List<Task>();
for (int i = 0; i < folders.Count; i++)
{
var folder = folders[i];
var task = GetSharePointHierarchy(folder, folderItem, $"{outPath}{folder.Name}\\");
tasks.Add(task);
}
await Task.WhenAll(tasks);
Full method
public async Task<GraphFolderItem> GetSharePointHierarchy(DriveItem currentDrive, GraphFolderItem parentFolderItem, string outPath = "")
{
IEnumerable<DriveItem> children = await graphHandler.GetFolderChildren(sourceSharepointId, currentDrive.Id);
var folders = new List<DriveItem>();
var files = new List<DriveItem>();
var graphFolderItems = new List<GraphFolderItem>();
foreach (var item in children)
{
if (item.Folder != null)
{
System.IO.Directory.CreateDirectory(outPath + item.Name);
//Console.WriteLine(outPath + item.Name);
folders.Add(item);
}
else
{
totalFileCount++;
files.Add(item);
}
}
var folderItem = new GraphFolderItem
{
SourceFolder = currentDrive,
ItemChildren = files,
FolderChildren = graphFolderItems,
DownloadPath = outPath
};
parentFolderItem.FolderChildren.Add(folderItem);
for (int i = 0; i < folders.Count; i++)
{
var folder = folders[i];
await GetSharePointHierarchy(folder, folderItem, $"{outPath}{folder.Name}\\");
}
return parentFolderItem;
}
It is race condition problem. In parallel execution, you should not use normal datatype or variable. You should always use thread safe concept as per your requirement like thread safe datatypes/collection or lock or Monitor or interlocked.
In this case, interlocked.Increment is good approach like replace the below one where using totalFileCount
Interlocked.Increment(ref totalFileCount);
Please refer the below link for good understanding
Thread Safe concept in details or Thread-safety
It seemed like the problem is when you are using the Task.WhenAll way you are making the code flow run in parallel and in the other way with await each time you run the async function, the code flow is actually not run in parallel
and this is exactly your problem, your source code inside the async function access to shared memory object - totalFileCount
What causing multi threads access to an object at the same time.
For fixing it and still execute the code in parallel, surround the access to totalFileCount instance with the lock statement which limit the number of concurrent executions of a block of code
lock(lockRefObject)
{
totalFileCount++;
}

Dequeue a collection to write to disk until no more items left in collection

So, I have a list of file shares.
I then need to obtain ALL folders in these file shares. This is all the "easy" stuff I have done.
Ultimately after some logic, I am adding an object into a collection.
All the while this is happening in the background using Async/Await and Tasks, I want to be able to have another thread/task spin up so it can keep going through the collection and write data to disk.
Now, for each folder, I obtain security information about that folder. There will be at LEAST 1 item. But for each item, I add this information into a collection (for each folder).
I want to write to disk in a background until there are no more folders to iterate through and the job is complete.
I was thinking of using a BlockingCollection however this code does smell and ultimately does not close the file because of the while(true) statement.
private static BlockingCollection<DirectorySecurityInformation> AllSecurityItemsToWrite = new BlockingCollection<DirectorySecurityInformation>();
if (sharesResults.Count > 0)
{
WriteCSVHeader();
// setup a background task which will dequeue items to write.
var csvBGTask = Task.Run(async () =>
{
using (var sw = new StreamWriter(FileName, true))
{
sw.AutoFlush = true;
while (true)
{
var dsi = AllSecurityItemsToWrite.Take();
await sw.WriteLineAsync("... blah blah blah...");
await sw.FlushAsync();
}
}
});
allTasks.Add(csvBGTask);
}
foreach(var currentShare in AllShares)
{
var dirs = Directory.EnumerateDirectories(currentShare .FullName, "*", SearchOption.AllDirectories);
foreach(var currentDir in dirs) { // Spin up a task in the BG and run to do some security analysis and add to the AllSecurityItemsToWrite collection }
}
This is at its simplest but core example.
Any ideas? I just want to keep adding on the background task and have another task just dequeue and write to disk until there are no more shares to go through (shareResults).
Recommand to use Channel.
Channel<DirectorySecurityInformation> ch =
Channel.CreateUnbounded<DirectorySecurityInformation>();
Write
var w = ch.Writer;
foreach(var dsi in DSIs)
w.TryWrite(dsi);
w.TryComplete();
Read
public async void ReadTask()
{
var r = ch.Reader;
using (var sw = new StreamWriter(filename, true))
{
await foreach(var dsi in r.ReadAllAsync())
sw.WriteLine(dsi);
}
}
while (true)
{
var dsi = AllSecurityItemsToWrite.Take();
//...
}
Instead of using the Take method, it's generally more convenient to consume a BlockingCollection<T> with the GetConsumingEnumerable method:
foreach (var dsi in AllSecurityItemsToWrite.GetConsumingEnumerable())
{
//...
}
This way the loop will stop automatically when the CompleteAdding method is called, and the collection is empty.
But I agree with shingo that the BlockingCollection<T> is not the correct tool in this case, because your workers are running on an asynchronous context. A Channel<T> should be preferable, because it can be consumed without blocking a thread.

C# Scan tree recursively with multiple threads

I'm scanning some directory for items. I've just read Multithreaded Directory Looping in C# question but I still want to make it multithreated. Even though everyone says the drive will be the bottleneck I have some points:
The drives may mostly be "single threaded" but how you know what they going to bring up in the future?
How you know the different sub-paths you are scanning are one the same physical drive?
I using an abstraction layer (even two) over the System.IO so that I can later reuse the code in different scenarios.
So, my first idea was to use Task and first dummy implementation was this:
public async Task Scan(bool recursive = false) {
var t = new Task(() => {
foreach (var p in path.scan) Add(p);
if (!recursive) return;
var tks = new Task[subs.Count]; var i = 0;
foreach (var s in subs) tks[i++] = s.Scan(true);
Task.WaitAll(tks);
}); t.Start();
await t;
}
I don't like the idea of creating a Task for each item and generally this doesn't seem ideal, but this was just for a test as Tasks are advertised to automatically manage the threads...
This method works but it's very slow. It takes above 5s to complete, while the single threated version below takes around 0.5s to complete the whole program on the same data set:
public void Scan2(bool recursive = false) {
foreach (var p in path.scan) Add(p);
if (!recursive) return;
foreach (var s in subs) s.Scan2(true);
}
I wander what really goes wrong with fist method. The machine is not on load, CUP usage is insignificant, drive is fine... I've tried profiling it with NProfiler it don't tell me much besides the program sits on Task.WaitAll(tks) all the time.
I also wrote a thread-locked counting mechanism that is invoked during addition of each item. Maybe it's the problem with it?
#region SubCouting
public Dictionary<Type, int> counters = new Dictionary<Type, int>();
private object cLock = new object();
private int _sc = 0;
public int subCount => _sc;
private void inCounter(Type t) {
lock (cLock) {
if (!counters.ContainsKey(t)) counters.Add(t, 1);
counters[t]++;
_sc++;
}
if (parent) parent.inCounter(t);
}
#endregion
But even if threads are waiting here, wouldn't the execution time be similar to single threaded version as opposed to 10x slower?
I'm not sure how to approach this. If I don't want to use tasks, do I need to manage threads manually or is there already some library that would fit nicely for the job?
I think you almost got it. Task.WaitAll(tks) is the problem. You block one thread for this as this is synchronous operation. You get out of threads soon, all threads are just waiting for some tasks that have no threads to run on. You can solve this with async, replace the waiting with await Task.WhenAll(...). It would free the thread when waiting. With some workload the multithreaded version is significantly faster. When just IO bound it is roughly equal.
ConcurrentBag<string> result = new ConcurrentBag<string>();
List<string> result2 = new List<string>();
public async Task Scan(string path)
{
await Task.Run(async () =>
{
var subs = Directory.GetDirectories(path);
await Task.WhenAll(subs.Select(s => Scan(s)));
result.Add(Enumerable.Range(0, 1000000).Sum(i => path[i % path.Length]).ToString());
});
}
public void Scan2(string path)
{
result2.Add(Enumerable.Range(0, 1000000).Sum(i => path[i % path.Length]).ToString());
var subs = Directory.GetDirectories(path);
foreach (var s in subs) Scan2(s);
}
private async void button4_Click(object sender, EventArgs e)
{
string dir = #"d:\tmp";
System.Diagnostics.Stopwatch st = new System.Diagnostics.Stopwatch();
st.Start();
await Scan(dir);
st.Stop();
MessageBox.Show(st.ElapsedMilliseconds.ToString());
st = new System.Diagnostics.Stopwatch();
st.Start();
Scan2(dir);
st.Stop();
MessageBox.Show(st.ElapsedMilliseconds.ToString());
MessageBox.Show(result.OrderBy(x => x).SequenceEqual(result2.OrderBy(x => x)) ? "OK" : "ERROR");
}

Using C# 5.0 async to read a file

I'm just starting out with C#'s new async features. I've read plenty of how-to's now on parallel downloads etc. but nothing on reading/processing a text file.
I had an old script I use to filter a log file and figured I'd have a go at upgrading it. However I'm unsure if my usage of the new async/await syntax is correct.
In my head I see this reading the file line by line and passing it on for processing in different thread so it can continue without waiting for a result.
Am I thinking about it correctly, or what is the best way to implement this?
static async Task<string[]> FilterLogFile(string fileLocation)
{
string line;
List<string> matches = new List<string>();
using(TextReader file = File.OpenText(fileLocation))
{
while((line = await file.ReadLineAsync()) != null)
{
CheckForMatch(line, matches);
}
}
return matches.ToArray();
}
The full script: http://share.linqpad.net/29kgbe.linq
In my head I see this reading the file line by line and passing it on for processing in different thread so it can continue without waiting for a result.
But that's not what your code does. Instead, you will (asynchronously) return an array when all reading is done. If you actually want to asynchronously return the matches one by one, you would need some sort of asynchronous collection. You could use a block from TPL Dataflow for that. For example:
ISourceBlock<string> FilterLogFile(string fileLocation)
{
var block = new BufferBlock<string>();
Task.Run(async () =>
{
string line;
using(TextReader file = File.OpenText(fileLocation))
{
while((line = await file.ReadLineAsync()) != null)
{
var match = GetMatch(line);
if (match != null)
block.Post(match);
}
}
block.Complete();
});
return block;
}
(You would need to add error handling, probably by faulting the returned block.)
You would then link the returned block to another block that will process the results. Or you could read them directly from the block (by using ReceiveAsync()).
But looking at the full code, I'm not sure this approach would be that useful to you. Because of the way you process the results (grouping and then ordering by count in each group), you can't do much with them until you have all of them.

Parallel.ForEach behaving like a regular for each towards the end of the iteration

I am having this issue when I ran something like this:
Parallel.ForEach(dataTable.AsEnumerable(), row =>
{
//do processing
}
Assuming that there are 500+ records say 870. Once the Parallel.ForEach completes 850, it seems to be running sequentially i.e. only 1 operation at a time. It completed 850 operations very fast but when it comes close to the end of the iteration it becomes very slow and seems to be performing like a regular for each. I even tried for 2000 records.
Is something wrong in my code? Please give suggestions.
Below is the code I am using
Sorry I just posted the wrong example. This is the correct code:
Task newTask = Task.Factory.StartNew(() =>
{
Parallel.ForEach(dtResult.AsEnumerable(), dr =>
{
string extractQuery = "";
string downLoadFileFullName = "";
lock (foreachObject)
{
string fileName = extractorConfig.EncodeFileName(dr);
extractQuery = extractorConfig.GetExtractQuery(dr);
if (string.IsNullOrEmpty(extractQuery)) throw new Exception("Extract Query not found. Please check the configuration");
string newDownLoadPath = CommonUtil.GetFormalizedDataPath(sDownLoadPath, uKey.CobDate);
//create folder if it doesn't exist
if (!Directory.Exists(newDownLoadPath)) Directory.CreateDirectory(newDownLoadPath);
downLoadFileFullName = Path.Combine(newDownLoadPath, fileName);
}
Interlocked.Increment(ref index);
ExtractorClass util = new ExtractorClass(SourceDbConnStr);
util.LoadToFile(extractQuery, downLoadFileFullName);
Interlocked.Increment(ref uiTimerIndex);
});
});
My guess:
This looks to have a high degree of potential IO from:
Database+Disk
Network communication to DB and back
Writing results to disk
Therefore a lot of time is going to be spent waiting for IO. My guess is that the waiting is only getting worse as more threads are being added to the mix and IO is being further stressed. For instance a disk only has one set of heads, so you cannot write to it concurrently. If you have a large number of threads trying to write concurrently, performance degrades.
Try limiting the maximum number of threads you are using:
var options = new ParallelOptions { MaxDegreeOfParallelism = 2 };
Parallel.ForEach(dtResult.AsEnumerable(), options, dr =>
{
//Do stuff
});
Update
After your code edit, I would suggest the following which has a couple of changes:
Reduce maximum number of threads - this can be experimented with.
Only perform directory check and creation once.
Code:
private static bool isDirectoryCreated;
//...
var options = new ParallelOptions { MaxDegreeOfParallelism = 2 };
Parallel.ForEach(dtResult.AsEnumerable(), options, dr =>
{
string fileName, extractQuery, newDownLoadPath;
lock (foreachObject)
{
fileName = extractorConfig.EncodeFileName(dr);
extractQuery = extractorConfig.GetExtractQuery(dr);
if (string.IsNullOrEmpty(extractQuery))
throw new Exception("Extract Query not found. Please check the configuration");
newDownLoadPath = CommonUtil.GetFormalizedDataPath(sDownLoadPath, uKey.CobDate);
if (!isDirectoryCreated)
{
if (!Directory.Exists(newDownLoadPath))
Directory.CreateDirectory(newDownLoadPath);
isDirectoryCreated = true;
}
}
string downLoadFileFullName = Path.Combine(newDownLoadPath, fileName);
Interlocked.Increment(ref index);
ExtractorClass util = new ExtractorClass(SourceDbConnStr);
util.LoadToFile(extractQuery, downLoadFileFullName);
Interlocked.Increment(ref uiTimerIndex);
});
It’s hard to give details without the relevant code but in general this is the expected behaviour. .NET tries to schedule the tasks such that every processor is evenly busy.
But this can only ever be approximated sind not all of the tasks take the same amount of time. At the end some processors will be done working and some won’t, and re-distributing the work is costly and not always beneficial.
I don’t know details about the load balancing used by PLinq but the bottom line is that this behaviour can never be fully prevented.

Categories