Reading a lot of files "at the same time"

Reading a lot of files "at the same time" - c#

I'm using FileSystemWatcher in order to catch every created, changed, deleted and renamed change over whichever file in a folder.
Over this changes I need to perform a simple checksum of the contents of these files. Simply, I'm opening a filestream and pass it to MD5 class:
private byte[] calculateChecksum(string frl)
{
using (FileStream stream = File.Open(frl, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
return this.md5.ComputeHash(stream);
}
}
The problem is according the amount of files I need to handle. For example, imagine I have 200 files created along the time in a folder, and then I copy all of them and paste them on the same folder. This action is going to cause 200 event and 200 calculateChecksum() performings.
How could I solve this kind of problems?

In FileSystemWatcher handler put tasks to queue that will processed by some worker. Worker can process checksum calc tasks with targeted speed or/and frequency. Probably one worker will be better because many readers can slow down hdd with many read seeks.
Try read about BlockingCollection:
https://msdn.microsoft.com/ru-ru/library/dd997371(v=vs.110).aspx
and Producer-Consumer Dataflow Pattern
https://msdn.microsoft.com/ru-ru/library/hh228601(v=vs.110).aspx
var workerCount = 2;
BlockingCollection<String>[] filesQueues= new BlockingCollection<String>[workerCount];
for(int i = 0; i < workerCount; i++)
{
filesQueues[i] = new BlockingCollection<String>(500);
// Worker
Task.Run(() =>
{
while (!filesQueues[i].IsCompleted)
{
string url;
try
{
url= filesQueues[i].Take();
}
catch (InvalidOperationException) { }
if (!string.IsNullOrWhiteSpace(url))
{
calculateChecksum(url);
}
}
}
}
// inside of FileSystemWatcher handler
var queueIndex = hash(filename) % workersCount
// Warning!!
// Blocks if numbers.Count == dataItems.BoundedCapacity
filesQueues[queueIndex].Add(fileName);
filesQueues[queueIndex].CompleteAdding();
Also you can make multiple consumers, just call Take or TryTake concurrently - each item will only be consumed by a single consumer. But take into account in that case one file can be processed by many workers, and multiple hdd readers can slow down hdd.
UPD in case of multiple workers, it would be better to make multiple BlockingCollections, and push files in queue with index:

I've scketched a cosumer-producer pattern to solve that, and I've tried to use a thread pool in order to smooth the big amount of work, sharing a BlockingCollection
BlockingCollection & ThreadPool:
private BlockingCollection<Index.ResourceIndexDocument> documents;
this.pool = new SmartThreadPool(SmartThreadPool.DefaultIdleTimeout, 4);
this.documents = new BlockingCollection<string>();
As you cann see, I've created a I treadPool setting concurrency to 4. So, there is going to work only 4 thread at the same time regasdless of whether there is x > 4 work's units to handle in the pool.
Producer:
public void warn(string channel, string frl)
{
this.pool.QueueWorkItem<string, string>(
(file) => this.files.Add(file),
channel,
frl
);
}
Consumer:
Task.Factory.StartNew(() =>
{
Index.ResourceIndexDocument document = null;
while (this.documents.TryTake(out document, TimeSpan.FromSeconds(1)))
{
IEnumerable<Index.ResourceIndexDocument> documents = this.documents.Take(this.documents.Count);
Index.IndexEngine.Instance.index(documents);
}
},
TaskCreationOptions.LongRunning
);

Related

In C#, how do I process a large text file with multiple threads/tasks, but with conditions?

I am writing a file-processing program in C#. I have a HUGE text file, with 5 columns of data each separated by a bar(|). The first column in each row is a column containing a person's name, and each person has a unique name.
Its a very large text file, so I want to process it concurrently using multiple tasks. But I want every row with the same name to be processed by the SAME task, not a different task. For example, if (part of) my file reads:
Jason|BMW|354|23|1/1/2000|1:03
Jason|BMW|354|23|1/1/2000|1:03
Jason|BMW|354|23|1/1/2000|1:03
Jason|Acura|354|23|1/1/2000|1:03
Jason|BMW|354|23|1/1/2000|1:03
Jason|BMW|354|23|1/1/2000|1:03
Jason|Hyundai|392|17|1/1/2000|1:06
Mike|Infiniti|335|18|8/24/2005|7:11
Mike|Infiniti|335|18|8/24/2005|7:11
Mike|Infiniti|335|18|8/24/2005|7:11
Mike|Dodge|335|18|8/24/2005|7:18
Mike|Infiniti|335|18|8/24/2005|7:11
Mike|Infiniti|335|18|8/24/2005|7:14
Then I want one task processing ALL the Jason rows, and another task processing ALL the Mike rows. I don't want the first task processing any Mike rows, and conversely I don't want the second task processing any Jason rows. Essentially, how can I make it so that all rows of a certain name are all processed by the SAME task? ALSO, how will I know when all the processing of all the rows has been completed? I've been racking my tiny brain and I can't come up with a solution.

One idea is to implement the producer-consumer pattern, with one producer that reads the file line-by-line, and multiple consumers that process the lines, one consumer per name. Since the number of unique names may be large, it would be impractical to dedicate a Thread for each consumer, so the consumers should process the data asynchronously. Each consumer should have its own private queue with data to process. The most efficient asynchronous queue currently available in .NET is the Channel<T> class, and using it as a building block would be a good idea, but I will suggest something higher-level that this: an ActionBlock<T> from the TPL Dataflow library. This component combines a processor and a queue, is async-enabled, and is highly configurable. So it will make for a succinct, quite readable, and hopefully quite efficient solution:
var processors = new Dictionary<string, ActionBlock<string>>();
foreach (var line in File.ReadLines(filePath))
{
string name = ExtractName(line); // Reads the first part of the line
if (!processors.TryGetValue(name, out ActionBlock<string> processor))
{
processor = CreateProcessor(name);
processors.Add(name, processor);
}
var accepted = processor.Post(line);
if (!accepted) break; // The processor has failed
}
// Signal that no more lines will be sent to the processors
foreach (var processor in processors.Values) processor.Complete();
// Aggregate the completion of all processors
Task allCompletions = Task.WhenAll(processors.Values.Select(p => p.Completion));
// Wait for the completion of all processors, and allow errors to propagate
allCompletions.Wait(); // or await allCompletions;
static ActionBlock<string> CreateProcessor(string name)
{
return new ActionBlock<string>((string line) =>
{
// Process the line
}, new ExecutionDataflowBlockOptions()
{
// Configure the options if the defaults are not optimal
});
}

I'd go for a concurrent dictionary of concurrent queues, keyed by name.
In the main thread (call it the reader), loop line by line enqueueing the lines to the appropriate concurrent queue (call these the worker queues), with creation of the a new worker queue and dedicated task as needed when a new name is encountered.
It would look something like this (note: this is semi-pseudo code and semi-real code and has no error checking, so treat it as a base for a solution, not the solution).
class FileProcessor
{
private ConcurrentDictionary<string, Worker> workers = new ConcurrentDictionary<string, Worker>();
class Worker
{
public Worker() => Task = Task.Run(Process);
private void Process()
{
foreach (var row in Queue.GetConsumingEnumerable())
{
if (row.Length == 0) break;
ProcessRow(row);
}
}
private void ProcessRow(string[] row)
{
// your implementation here
}
public Task Task { get; }
public BlockingCollection<string[]> Queue { get; } = new BlockingCollection<string[]>(new ConcurrentQueue<string[]>());
}
void ProcessFile(string fileName)
{
foreach (var line in GetLinesOfFile(fileName))
{
var row = line.Split('|');
var name = row[0];
// create worker as needed
var worker = workers.GetOrAdd(name, x => new Worker());
// add a row for the worker to work on
worker.Queue.Add(row);
}
// send an empty array to each worker to signal end of input
foreach (var worker in workers.Values)
worker.Queue.Add(new string[0]);
// now wait for all workers to be done
Task.WaitAll(workers.Values.Select(x => x.Task).ToArray());
}
private static IEnumerable<string> GetLinesOfFile(string fileName)
{
// this helps limit memory consumption by not loading
// the whole file at once
return File.ReadLines(fileName);
}
}
I suggest that your reader thread stream the file rather than reading the entire file; you stated the file was huge, so streaming would be memory friendly). That reader thread is I/O bound, so if you can async/await it, that would be better than my simple Process() doing a foreach with no awaiting.
The features of this approach:
dedicated task per person's name
use of a sentinel value to signal end of input
use of Task.WaitAll to join back to the main thread
assumes the tasks are CPU bound. If they are I/O bound, consider using async/await and Task.WhenAll instead
file is streamed into memory with File.ReadLines()
names do not need to be sorted because the queue to enqueue to is selected by name on-demand
Refinements
In the interest of completeness, the approach above is a bit naive and can be refined by... reading all of the comments and answers; users Zoulias and Mercer in particular have good points. We can refine this approach with
adapt this to TPL Channels and use CompleteAdding. These are not only better abstractions, but more efficient (abstraction and efficient can often be at odds, but not in this case).
reduce the name-to-thread or name-to-task dedication, which can exhaust resources in the case of a large number of names, and instead map names to buckets or partitions where each bucket/partition has a dedicated task/thread.
For the second point, for example, you could have
// create worker as needed
var worker = workers.GetOrAdd(GetPartitionKey(name), x => new Worker());
where GetPartitionKey() could be implemented something like
private string GetPartitionKey(string name) =>
name[0] switch
{
>= 'a' and <= 'f' => "A thru F bucket",
>= 'A' and <= 'F' => "A thru F bucket",
>= 'g' and <= 'k' => "G thru K bucket",
>= 'G' and <= 'K' => "G thru K bucket",
_ => "everything else bucket"
}
or whatever algorithm you want to use as a partition selector.

how can I make it so that all rows of a certain name are all processed by the SAME task?
A System.Threading.Task can be created using various TaskCreationOptions that dictate how and when their threads and resources are managed during their lifetime. For an operation for consuming large amount of data and furthermore segregating the consumption of data to specific threads - you may want to consider creating the tasks that are responsible for individual names with the option TaskCreationOptions.LongRunning which may provide a hint to the task scheduler that an additional thread might be required for the task so that it does not block the forward progress of other threads or work items on the local thread-pool queue.
For the actual how, I would recommend starting various 'Worker' threads, each with their own Task and a way for your main task (the one reading the file, or parsing the JSON data) to communicate between the two that more work needs to be completed.
Consider the use of thread-safe collections such as a ConcurrentQueue<T> or other various collections that may help you in streaming data between threads for consumption safely.
Here's a very limited example of the structure you may want to consider:
void Worker(ConcurrentQueue<string> Queue, CancellationToken Token)
{
// keep the worker in a
while (Token.IsCancellationRequested is false)
{
// check to see if the queue has stuff, and consume it
if (Queue.TryDequeue(out string line))
{
Console.WriteLine($"Consumed Line {line} {Thread.CurrentThread.ManagedThreadId}");
}
// yield the thread incase other threads have work to do
Thread.Sleep(10);
}
Console.WriteLine("Finished Work");
}
// data could be a reader, list, array anything really
IEnumerable<string> Data()
{
yield return "JASON";
yield return "Mike";
yield return "JASON";
yield return "Mike";
}
void Reader()
{
// create some collections to stream the data to other tasks
ConcurrentQueue<string> Jason = new();
ConcurrentQueue<string> Mike = new();
// make sure we have a way to cancel the workers if we need to
CancellationTokenSource tokenSource = new();
// start some worker tasks that will consume the data
Task[] workers = {
new Task(()=> Worker(Jason, tokenSource.Token), TaskCreationOptions.LongRunning),
new Task(()=> Worker(Mike, tokenSource.Token), TaskCreationOptions.LongRunning)
};
for (int i = 0; i < workers.Length; i++)
{
workers[i].Start();
}
// iterate the data and send it off to the queues for consumption
foreach (string line in Data())
{
switch (line)
{
case "JASON":
Console.WriteLine($"Sent line to JASON {Thread.CurrentThread.ManagedThreadId}");
Jason.Enqueue(line);
break;
case "Mike":
Console.WriteLine($"Sent line to Mike {Thread.CurrentThread.ManagedThreadId}");
Mike.Enqueue(line);
break;
default:
Console.WriteLine($"Disposed unknown line {Thread.CurrentThread.ManagedThreadId}");
break;
}
}
// make sure that worker threads are cancelled if parent task has been cancelled
try
{
// wait for workers to finish by checking collections
do
{
Thread.Sleep(10);
} while (Jason.IsEmpty is false && Mike.IsEmpty is false);
}
finally
{
// cancel the worker threads, if they havent already
tokenSource.Cancel();
}
}
// make sure we have a way to cancel the reader if we need to
CancellationTokenSource tokenSource = new();
// start the reader thread
Task[] tasks = { Task.Run(Reader, tokenSource.Token) };
Console.WriteLine("Starting Reader");
Task.WaitAll(tasks);
Console.WriteLine("Finished Reader");
// cleanup the tasks if they are still running some how
tokenSource?.Cancel();
// dispose of IDisposable Object
tokenSource?.Dispose();
Console.ReadLine();

Managing multi-threading in C# .NET, controlling thread count per operation

Think the title I've given is a bit confusing but hard to express what I'm trying to ask.
Basically I am writing a program in C# using .NET that uses the Google cloud API in order to upload data.
I am trying to do this in an efficient way and have used parallel.foreach with success but I need finer control. I collect the files to be uploaded into one list, which I want to sort by file size and then split into say 3 equally sized (in terms of gigabytes not file count) lists.
One of these lists will contain say a third in terms of total upload size but be comprised of the largest files (in gigabytes) but therefore the smallest count of files, the next list will be the same total gigabytes as the first list but be comprised of a greater number of smaller files and finally the last list will be comprised of many many small files but should also total the same size as the other sub lists.
I then want to assign a set number of threads to the upload process. (e.g. I want the largest file list to have 5 threads assigned, the middle to have 3 and the small file list to have only 2 thread.) Is it possible to set up these 3 lists to be iterated over in parallel, while controlling how many threads are allocated?
What is the best method to do so?

Parallel.ForEach and PLINQ are meant for data parallelism - processing big chunks of data using multiple cores. It's meant for scenarios where you have eg 1GB of data in memory (or a very fast IEnumerable source) and want to process it using all cores. In such scenarios, you need to partition the data into independent chunks and have one worker crunch one crunch at a time, to limit the synchronization overhead.
What you describe though is concurrent uploads for a large number of files. That's pure IO, not data parallelism. Most of the time will be spent loading the data from disk or writing it to the network. This is a job for Task.Run and async/await. To upload multiple files concurrently, you could use an ActionBlock or a Channel to queue the files and upload them asynchronously. With channels you have to write a bit of worker boilerplate but you get greater control, especially in cases where you want to use eg the same client instance for multiple calls. An ActionBlock is essentially stateless.
Finally, you describe queues with different DOP based on size, which is a very nice idea when you have both big and small files. You can do that by using multiple ActionBlock instances, each with a different DOP, or multiple Channel workers, each with a different DOP.
Dataflows
Let's say you already have a method that uploads a file by path name :
//Adopted from the Google SDK example
async Task UploadFile(DriveService service,FileInfo file)
{
var fileName=Path.GetFileName(filePath);
using var uploadStream = file.OpenRead();
var request insertRequest = service.Files.Insert(
new File { Title = file.Name },
uploadStream,
"image/jpeg");
await insert.UploadAsync();
}
You can create three different ActionBlock instances, each with a different DOP :
var small=new ActionBlock<FileInfo>(
file=>UploadFile(service,file),
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 15
});
var medium=new ActionBlock<FileInfo>(
file=>UploadFile(service,file),
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 10
});
var big=new ActionBlock<FileInfo>(
path=>UploadFile(service,file),
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 2
});
And post different files to different blocks based on size :
var directory=new DirectoryInfo(...);
var files=directory.EnumerateFiles(...);
foreach(var file in files)
{
switch (file.Length)
{
case int x when x < 1024:
small.Post(file);
break;
case int x when x < 10240:
medium.Post(file);
break;
default:
big.Post(file);
break;
}
}
Or, in C# 8 :
foreach(var file in files)
{
var block = file.Length switch {
long x when x < 1024 => small,
long x when x < 10240=> medium,
_ => big
};
block.Post(file)
}
When iteration completes, we need to tell the blocks we are done by calling Complete() on each one and waiting for all of them to finish with :
small.Complete();
medium.Complete();
big.Complete();
await Task.WhenAll(small.Completion, medium.Completion, big.Completion);

Here is another idea. You could have a single list, but upload the files with a dynamic degree of parallelism. This would be easy to implement if the SemaphoreSlim class had a WaitAsync method that could reduce the CurrentCount by a value other than 1. You could then initialize the SemaphoreSlim with a large initialCount like 1000, and then call WaitAsync with a value relative to the size of each file. Lets call this value weight. The semaphore would guarantee that the sum weight of all files currently uploaded would not exceed 1000. This could be a single huge file with weight of 1000, or 10 medium files each weighing 100, or a mix of small, medium and large files with total weight around 1000. The degree of parallelism would constantly change depending on the weight of the next file in the list.
This is an example of the code that you'd have to write:
var semaphore = new SemaphoreSlim(1000);
var tasks = Directory.GetFiles(#"D:\FilesToUpload")
.Select(async filePath =>
{
var fi = new FileInfo(filePath);
var weight = (int)Math.Min(1000, fi.Length / 1_000_000);
await semaphore.WaitAsync(weight); // Imaginary overload that accepts weight
try
{
await cloudService.UploadFile(filePath);
}
finally
{
semaphore.Release(weight);
}
})
.ToArray();
await Task.WhenAll(tasks);
Below is a custom AsyncSemaphorePlus class that provides the missing overload. It is based on Stephen Toub's AsyncSemaphore class from the blog post Building Async Coordination Primitives, Part 5: AsyncSemaphore. It is slightly modernized with features like Task.CompletedTask and TaskCreationOptions.RunContinuationsAsynchronously, that were not available at the time the blog post was written.
public class AsyncSemaphorePlus
{
private readonly object _locker = new object();
private readonly Queue<(TaskCompletionSource<bool>, int)> _queue
= new Queue<(TaskCompletionSource<bool>, int)>();
private int _currentCount;
public int CurrentCount { get { lock (_locker) return _currentCount; } }
public AsyncSemaphorePlus(int initialCount)
{
if (initialCount < 0)
throw new ArgumentOutOfRangeException(nameof(initialCount));
_currentCount = initialCount;
}
public Task WaitAsync(int count)
{
lock (_locker)
{
if (_currentCount - count >= 0)
{
_currentCount -= count;
return Task.CompletedTask;
}
else
{
var tcs = new TaskCompletionSource<bool>(
TaskCreationOptions.RunContinuationsAsynchronously);
_queue.Enqueue((tcs, count));
return tcs.Task;
}
}
}
public void Release(int count)
{
lock (_locker)
{
_currentCount += count;
while (_queue.Count > 0)
{
var (tcs, weight) = _queue.Peek();
if (weight > _currentCount) break;
(tcs, weight) = _queue.Dequeue();
_currentCount -= weight;
tcs.SetResult(true);
}
}
}
}
Update: This approach is intended for uploading a medium/large number of files. It is not suitable for extremely huge number of files, because all uploading tasks are created upfront. If the files that have to be uploaded are, say, 100,000,000, then the memory required to store the state of all these tasks may exceed the available RAM of the machine. For uploading that many files the solution proposed by Panagiotis Kanavos is probably preferable, because in can be easily modified with bounded dataflow blocks, and by feeding them with SendAsync instead of Post, so that the memory required for the whole operation is kept under control.

Handle multiple threads, one out one in, in a timed loop

I need to process a large number of files overnight, with a defined start and end time to avoid disrupting users. I've been investigating but there are so many ways of handling threading now that I'm not sure which way to go. The files come into an Exchange inbox as attachments.
My current attempt, based on some examples from here and a bit of experimentation, is:
while (DateTime.Now < dtEndTime.Value)
{
var finished = new CountdownEvent(1);
for (int i = 0; i < numThreads; i++)
{
object state = offset;
finished.AddCount();
ThreadPool.QueueUserWorkItem(delegate
{
try
{
StartProcessing(state);
}
finally
{
finished.Signal();
}
});
offset += numberOfFilesPerPoll;
}
finished.Signal();
finished.Wait();
}
It's running in a winforms app at the moment for ease, but the core processing is in a dll so I can spawn the class I need from a windows service, from a console running under a scheduler, however is easiest. I do have a Windows Service set up with a Timer object that kicks off the processing at a time set in the config file.
So my question is - in the above code, I initialise a bunch of threads (currently 10), then wait for them all to process. My ideal would be a static number of threads, where as one finishes I fire off another, and then when I get to the end time I just wait for all threads to complete.
The reason for this is that the files I'm processing are variable sizes - some might take seconds to process and some might take hours, so I don't want the whole application to wait while one thread completes if I can have it ticking along in the background.
(edit)As it stands, each thread instantiates a class and passes it an offset. The class then gets the next x emails from the inbox, starting at the offset (using the Exchange Web Services paging functionality). As each file is processed, it's moved to a separate folder. From some of the replies so far, I'm wondering if actually I should grab the e-mails in the outer loop, and spawn threads as needed.
To cloud the issue, I currently have a backlog of e-mails that I'm trying to process through. Once the backlog has been cleared, it's likely that the nightly run will have a significantly lower load.
On average there are around 1000 files to process each night.
Update
I've rewritten large chunks of my code so that I can use the Parallel.Foreach and I've come up against an issue with thread safety. The calling code now looks like this:
public bool StartProcessing()
{
FindItemsResults<Item> emails = GetEmails();
var source = new CancellationTokenSource(TimeSpan.FromHours(10));
// Process files in parallel, with a maximum thread count.
var opts = new ParallelOptions { MaxDegreeOfParallelism = 8, CancellationToken = source.Token };
try
{
Parallel.ForEach(emails, opts, processAttachment);
}
catch (OperationCanceledException)
{
Console.WriteLine("Loop was cancelled.");
}
catch (Exception err)
{
WriteToLogFile(err.Message + "\r\n");
WriteToLogFile(err.StackTrace + "r\n");
}
return true;
}
So far so good (excuse temporary error handling). I have a new issue now with the fact that the properties of the "Item" object, which is an email, not being threadsafe. So for example when I start processing an e-mail, I move it to a "processing" folder so that another process can't grab it - but it turns out that several of the threads might be trying to process the same e-mail at a time. How do I guarantee that this doesn't happen? I know I need to add a lock, can I add this in the ForEach or should it be in the processAttachments method?

Use the TPL:
Parallel.ForEach( EnumerateFiles(),
new ParallelOptions { MaxDegreeOfParallelism = 10 },
file => ProcessFile( file ) );
Make EnumerateFiles stop enumerating when your end time is reached, trivially like this:
IEnumerable<string> EnumerateFiles()
{
foreach (var file in Directory.EnumerateFiles( "*.txt" ))
if (DateTime.Now < _endTime)
yield return file;
else
yield break;
}

You can use a combination of Parallel.ForEach() along with a cancellation token source which will cancel the operation after a set time:
using System;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
namespace Demo
{
static class Program
{
static Random rng = new Random();
static void Main()
{
// Simulate having a list of files.
var fileList = Enumerable.Range(1, 100000).Select(i => i.ToString());
// For demo purposes, cancel after a few seconds.
var source = new CancellationTokenSource(TimeSpan.FromSeconds(10));
// Process files in parallel, with a maximum thread count.
var opts = new ParallelOptions {MaxDegreeOfParallelism = 8, CancellationToken = source .Token};
try
{
Parallel.ForEach(fileList, opts, processFile);
}
catch (OperationCanceledException)
{
Console.WriteLine("Loop was cancelled.");
}
}
static void processFile(string file)
{
Console.WriteLine("Processing file: " + file);
// Simulate taking a varying amount of time per file.
int delay;
lock (rng)
{
delay = rng.Next(200, 2000);
}
Thread.Sleep(delay);
Console.WriteLine("Processed file: " + file);
}
}
}
As an alternative to using a cancellation token, you can write a method that returns IEnumerable<string> which returns the list of filenames, and stop returning them when time is up, for example:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
namespace Demo
{
static class Program
{
static Random rng = new Random();
static void Main()
{
// Process files in parallel, with a maximum thread count.
var opts = new ParallelOptions {MaxDegreeOfParallelism = 8};
Parallel.ForEach(fileList(), opts, processFile);
}
static IEnumerable<string> fileList()
{
// Simulate having a list of files.
var fileList = Enumerable.Range(1, 100000).Select(x => x.ToString()).ToArray();
// Simulate finishing after a few seconds.
DateTime endTime = DateTime.Now + TimeSpan.FromSeconds(10);
int i = 0;
while (DateTime.Now <= endTime)
yield return fileList[i++];
}
static void processFile(string file)
{
Console.WriteLine("Processing file: " + file);
// Simulate taking a varying amount of time per file.
int delay;
lock (rng)
{
delay = rng.Next(200, 2000);
}
Thread.Sleep(delay);
Console.WriteLine("Processed file: " + file);
}
}
}
Note that you don't need the try/catch with this approach.

You should consider using Microsoft's Reactive Framework. It lets you use LINQ queries to process multithreaded asynchronous processing in a very simple way.
Something like this:
var query =
from file in filesToProcess.ToObservable()
where DateTime.Now < stopTime
from result in Observable.Start(() => StartProcessing(file))
select new { file, result };
var subscription =
query.Subscribe(x =>
{
/* handle result */
});
Truly, that's all the code you need if StartProcessing is already defined.
Just NuGet "Rx-Main".
Oh, and to stop processing at any time just call subscription.Dispose().

This was a truly fascinating task, and it took me a while to get the code to a level that I was happy with it.
I ended up with a combination of the above.
The first thing worth noting is that I added the following lines to my web service call, as the operation timeout I was experiencing, and which I thought was because I'd exceeded some limit set on the endpoint, was actually due to a limit set by microsoft way back in .Net 2.0:
ServicePointManager.DefaultConnectionLimit = int.MaxValue;
ServicePointManager.Expect100Continue = false;
See here for more information:
What to set ServicePointManager.DefaultConnectionLimit to
As soon as I added those lines of code, my processing increased from 10/minute to around 100/minute.
But I still wasn't happy with the looping, and partitioning etc. My service moved onto a physical server to minimise CPU contention, and I wanted to allow the operating system to dictate how fast it ran, rather than my code throttling it.
After some research, this is what I ended up with - arguably not the most elegant code I've written, but it's extremely fast and reliable.
List<XElement> elements = new List<XElement>();
while (XMLDoc.ReadToFollowing("ElementName"))
{
using (XmlReader r = XMLDoc.ReadSubtree())
{
r.Read();
XElement node = XElement.Load(r);
//do some processing of the node here...
elements.Add(node);
}
}
//And now pass the list of elements through PLinQ to the actual web service call, allowing the OS/framework to handle the parallelism
int failCount=0; //the method call below sets this per request; we log and continue
failCount = elements.AsParallel()
.Sum(element => IntegrationClass.DoRequest(element.ToString()));
It ended up fiendishly simple and lightning fast.
I hope this helps someone else trying to do the same thing!

Advice on processing giant text file and processing URL's

I'm currently trying to loop through a text file that is about 1.5gb's in size and then use the URL's that are grabbed from it to pull down the html from the site.
For speed I'm trying to process all the HTTP request on a new thread but since C# is not my strongest language but a requirement for what I'm doing I'm a bit confused on good thread practice.
This is how I'm processing the list
private static void Main()
{
const Int32 BufferSize = 128;
using (var fileStream = File.OpenRead("dump.txt"))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize))
{
String line;
var progress = 0;
while ((line = streamReader.ReadLine()) != null)
{
var stuff = line.Split('|');
getHTML(stuff[3]);
progress += 1;
Console.WriteLine(progress);
}
}
}
And I'm pulling down the HTML as so
private static void getHTML(String url)
{
new Thread(() =>
{
var client = new DecompressGzipResponse();
var html = client.DownloadString(url);
}).Start();
}
Though the speeds are fast doing this initially, after about 20 thousand they slow down and eventually after 32 thousand the application will hang and crash. I was under the impression C# threads terminated when the function completed?
Can anyone give any examples/ suggestions on how to do this better?

One very reliable way to do this is by using the producer-consumer pattern. You create a thread-safe queue of URLs (for example, BlockingCollection<Uri>). Your main thread is the producer, which adds items to the queue. You then have multiple consumer threads, each of which reads Urls from the queue and does the HTTP requests. See BlockingCollection.
Setting it up isn't terribly difficult:
BlockingCollection<Uri> UrlQueue = new BlockingCollection<Uri>();
// Main thread starts the consumer threads
Task t1 = Task.Factory.StartNew(() => ProcessUrls, TaskCreationOptions.LongRunning);
Task t2 = Task.Factory.StartNew(() => ProcessUrls, TaskCreationOptions.LongRunning);
// create more tasks if you think necessary.
// Now read your file
foreach (var line in File.ReadLines(inputFileName))
{
var theUri = ExtractUriFromLine(line);
UrlQueue.Add(theUri);
}
// when done adding lines to the queue, mark the queue as complete
UrlQueue.CompleteAdding();
// now wait for the tasks to complete.
t1.Wait();
t2.Wait();
// You could also use Task.WaitAll if you have an array of tasks
The individual threads process the urls with this method:
void ProcessUrls()
{
foreach (var uri in UrlQueue.GetConsumingEnumerable())
{
// code here to do a web request on that url
}
}
That's a simple and reliable way to do things, but it's not especially quick. You can do much better by using a second queue of WebCient objects that make asynchronous requests For example, say you want to have 15 asynchronous requests. You start the same way with a BlockingCollection, but you only have one persistent consumer thread.
const int MaxRequests = 15;
BlockingCollection<WebClient> Clients = new BlockingCollection<WebClient>();
// start a single consumer thread
var ProcessingThread = Task.Factory.StartNew(() => ProcessUrls, TaskCreationOptions.LongRunning);
// Create the WebClient objects and add them to the queue
for (var i = 0; i < MaxRequests; ++i)
{
var client = new WebClient();
// Add an event handler for the DownloadDataCompleted event
client.DownloadDataCompleted += DownloadDataCompletedHandler;
// And add this client to the queue
Clients.Add(client);
}
// add the code from above that reads the file and populates the queue
Your processing function is somewhat different:
void ProcessUrls()
{
foreach (var uri in UrlQueue.GetConsumingEnumerable())
{
// Wait for an available client
var client = Clients.Take();
// and make an asynchronous request
client.DownloadDataAsync(uri, client);
}
// When the queue is empty, you need to wait for all of the
// clients to complete their requests.
// You know they're all done when you dequeue all of them.
for (int i = 0; i < MaxRequests; ++i)
{
var client = Clients.Take();
client.Dispose();
}
}
Your DownloadDataCompleted event handler does something with the data that was downloaded, and then adds the WebClient instance back to the queue of clients.
void DownloadDataCompleteHandler(Object sender, DownloadDataCompletedEventArgs e)
{
// The data downloaded is in e.Result
// be sure to check the e.Error and e.Cancelled values to determine if an error occurred
// do something with the data
// And then add the client back to the queue
WebClient client = (WebClient)e.UserState;
Clients.Add(client);
}
This should keep you going with 15 concurrent requests, which is about all you can do without getting a bit more complicated. Your system can likely handle many more concurrent requests, but the way that WebClient starts asynchronous requests requires some synchronous work up front, and that overhead makes 15 about the maximum number you can handle.
You might be able to have multiple threads initiating the asynchronous requests. In that case, you could potentially have as many threads as you have processor cores. So on a quad core machine, you could have the main thread and three consumer threads. With three consumer threads this technique could give you 45 concurrent requests. I'm not certain that it scales that well, but it might be worth a try.
There are ways to have hundreds of concurrent requests, but they're quite a bit more complicated to implement.

You need thread management.
My advice is to use Tasks instead of creating your own Threads.
By using the Task Parallel Library, you let the runtime deal with the thread management. By default, it will allocate your tasks on threads from the ThreadPool, and will allow a level of concurrency which is contingent on the number of CPU cores you have. It will also reuse existing Threads when they become available instead of wasting time creating new ones.
If you want to get more advanced, you can create your own task scheduler to manage the scheduling aspect yourself.
See also What is difference between Task and Thread?

Handle Queue faster

I have a queue , it receives data overtime.
I used multi thread to dequeue and save to database.
I create an Array of Thread to do this job.
for (int i = 0; i < thr.Length; i++)
{
thr[i] = new Thread(new ThreadStart(SaveData));
thr[i].Start();
}
SaveData
Note : eQ and eiQ is 2 global queue. I used while to keep thread alive.
public void SaveData()
{
var imgDAO = new imageDAO ();
string exception = "";
try
{
while (eQ.Count > 0 && eiQ.Count > 0)
{
var newRecord = eQ.Dequeue();
var newRecordImage = eiQ.Dequeue();
imageDAO.SaveEvent(newEvent, newEventImage);
var storepath = Properties.Settings.Default.StorePath;
save.WriteFile(storepath, newEvent, newEventImage);
}
}
catch (Exception e)
{
Global._logger.Info(e.Message + e.Source);
}
}
It did create multi thread but when I debug, only 1 thread alive, the rest is dead.
I dont know why ? Any one have idea? Tks

You are using WriteFile in that thread function.
Is this possible, that you trying to write file that may be locked by another thread (same filename or something)?
And one more thing - saving data on disk by multiple threads - i dont like it.
I think you should create some buffer instead of many threads and write it every few records/entries.

As mentioned in the comments, your threads will only live as long as there are elements in the queues, as soons as both are emptied the threads will terminate. This could explain why you see only one living thread while debugging.
A potential answer to you question would be to use a BlockingCollection from the System.Collections.Concurrent classes instead of a Queue. That has the capability of doing a blocking dequeue, which will stop the thread(s) until more elements are available for processing.
Another problem is the nice race condition between eQ and eiQ -- consider using a single queue with a Tuple or custom data type so you can dequeue both newRecord and newRecordImage in a single operation.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading a lot of files "at the same time" - c#

Related

In C#, how do I process a large text file with multiple threads/tasks, but with conditions?

Managing multi-threading in C# .NET, controlling thread count per operation

Handle multiple threads, one out one in, in a timed loop

Advice on processing giant text file and processing URL's

Handle Queue faster

Categories

Resources