Hi I am trying to mimic multi threading with Parallel.ForEach loop. Below is my function:
public void PollOnServiceStart()
{
constants = new ConstantsUtil();
constants.InitializeConfiguration();
HashSet<string> newFiles = new HashSet<string>();
//string serviceName = MetadataDbContext.GetServiceName();
var dequeuedItems = MetadataDbContext
.UpdateOdfsServiceEntriesForProcessingOnStart();
var handlers = Producer.GetParserHandlers(dequeuedItems);
while (handlers.Any())
{
Parallel.ForEach(handlers,
new ParallelOptions { MaxDegreeOfParallelism = 4 },
handler =>
{
Logger.Info($"Started processing a file remaining in Parallel ForEach");
handler.Execute();
Logger.Info($"Enqueing one file for next process");
dequeuedItems = MetadataDbContext
.UpdateOdfsServiceEntriesForProcessingOnPollInterval(1);
handlers = Producer.GetParserHandlers(dequeuedItems);
});
int filesRemovedCount = Producer.RemoveTransferredFiles();
Logger.Info($"{filesRemovedCount} files removed from {Constants.OUTPUT_FOLDER}");
}
}
So to explain what's going on. The function UpdateOdfsServiceEntriesForProcessingOnStart() gets 4 file names (4 because of parallel count) and adds them to a thread safe object called ParserHandler. These objects are then put into a list var handlers.
My idea here is to loop through this handler list and call the handler.Execute().
Handler.Execute() copies files from the network location onto a local drive, parses through the file and creates multiple output files, then sends said files to a network location and updates a DB table.
What I expect in this Parallel For Each loop is that after the Handler.Execute() call, UpdateOdfsServiceEntriesForProcessingOnPollInterval(1) function will add a new file name from the db table it reads to the dequeued items container which will then be passed as one item to the recreated handler list. In this way, after one file is done executing, a new file will take its place for each parallel loop.
However what happens is that while I do get a new file added it doesn't get executed by the next available thread. Instead what happens is that the parallel for each has to finish executing the first 4 files and then it will pick up the very next file. Meaning, after the first 4 are ran in parallel, only 1 file is ran at a time thereby nullifying the whole point of the parallel looping. The initial files added before all 4 files finish the Execute() call are never executed.
IE:
(Start1, Start2, Start3, Start4) all at once. What should happen is something like (End2, Start5), and then (End3, Start6). But what is happening is (End 2, End 3, End 1, End 4), Then Start5. End5. Start6, End6.
Why is this happening?
Because we want to deploy multiple of instances of this service app in a machine, it is not beneficial to have a giant list waiting in queue. This is wasteful as the other app instances wont be able to process things.
I am writing what should be a long comment as an answer, although it's an awful answer because it doesn't answer the question.
Be aware that parallelizing filesystem operations is unlikely to make them faster, especially if the storage is a classic hard disk. The head of the disk cannot be in N places at the same moment, and if you tell it to do so will just waste most of its time traveling instead of reading or writing.
The best way to overcome the bottleneck imposed by accessing the filesystem is to make sure that there is work for the disk to do at all moments. Don't stop the disk's work to make a computation or to fetch/save data from/to the database. To make this happen you must have multiple workflows running concurrently. One workflow will do entirely I/O with the disk, another workflow will talk continuously with the database, a third workflow will utilize the CPU by doing the one calculation after the other etc. This approach is called task parallelism (doing heterogeneous work in parallel), as opposed with data parallelism (doing homogeneous work in parallel, the speciality of Parallel.ForEach). It is also called pipelining, because in order to make all workflows run concurrently you must place intermediate buffers between them, so you create a pipeline with the data flowing from buffer to buffer. Another term used for this kind of operations is producer-consumer pattern, which describes a short pipeline consisting by only two building blocks, with the first being the producer and the second the consumer.
The most powerful tool currently available¹ to create pipelines, is the TPL Dataflow library. It offers a variety of "blocks" (pipeline segments) that can be linked with each other, and can cover most scenarios. What you do is that you instantiate the blocks that will compose your pipeline, you configure them, you tell each one what work it should do, you link them together, you feed the first block with the initial raw data that should be processed, and finally await for the Completion of the last block. You can look at an example of using the TPL Dataflow library here.
¹ Available as built-in library in the .NET platform. Powerful third-party tools also exist, like the Akka.NET for example.
Related
I have an Azure WebJob that loops through the pages of a file and processes them. The job also has an ICollector to an output queue:
[Queue("batch-pages-to-process")] ICollector<QueueMessageBatchPage> outputQueueMessage
I need to wait until all of the pages are processed before I send everything to the output queue, so instead of adding each message to the ICollector in my file processing loop, I add the messages to a list of queue messages:
List<QueueMessageBatchPage>
After all of the pages have been dealt with, I then loop through the list and add the messages to the ICollector:
foreach (var m in outputMessages)
{
outputQueueMessage.Add(m);
}
But this last part seems to take a long time. To add 300 queue messages, it takes almost 50 seconds. I don't have much to gauge by, but that seems slow. Is this normal?
There's no objective standard of slow vs. fast to offer you, but a few thoughts:
a) Part of the queuing time will be serialization of each QueueMessageBatchPage instance... the performance of that will be inversely related to the breadth and depth of the object graphs those instances represent. More data obviously takes more time to write to the queue.
b) I know you mentioned that you can't write to the queue until all file lines have been processed, but if at all possible you might reconsider that choice. To the extent you could parallelize both the processing of lines in the file and subsequent writing to the output queue (using either multiple WebJob instances or perhaps TPL Tasks within a single WebJob instance), you could potentially get this work done a lot faster. Again, I realize you stated upfront that you can't do that, so I'm just suggesting you consider the full implications of that choice (if you haven't already).
c) One other possibility to look at... make sure the region where your storage queue lives is the same as where your WebJob lives, to minimize latency.
Best of luck!
This is my first post here, so apologies if this isn't structured well.
We have been tasked to design a tool that will:
Read a file (of account IDs), CSV format
Download the account data file from the web for each account (by Id) (REST API)
Pass the file to a converter that will produce a report (financial predictions etc) [~20ms]
If the prediction threshold is within limits, run a parser to analyse the data [400ms]
Generate a report for the analysis above [80ms]
Upload all files generated to the web (REST API)
Now all those individual points are relatively easy to do. I'm interested in finding out how best to architect something to handle this and to do it fast & efficiently on our hardware.
We have to process roughly around 2 Million accounts. The square brackets gives an idea of how long each process takes on average. I'd like to use the maximum resources available on the machine - 24 core Xeon processors. It's not a memory intensive process.
Would using TPL and creating each of these as a task be a good idea? Each has to happen sequentially but many can be done at once. Unfortunately the parsers are not multi-threading aware and we don't have the source (it's essentially a black box for us).
My thoughts were something like this - assumes we're using TPL:
Load account data (essentially a CSV import or SQL SELECT)
For each Account (Id):
Download the data file for each account
ContinueWith using the data file, send to the converter
ContinueWith check threshold, send to parser
ContinueWith Generate Report
ContinueWith Upload outputs
Does that sound feasible or am I not understanding it correctly? Would it be better to break down the steps a different way?
I'm a bit unsure on how to handle issues with the parser throwing exceptions (it's very picky) or when we get failures uploading.
All this is going to be in a scheduled job that will run after-hours as a console application.
I would think about using some kind of messagebus. So you can seperate the steps and if one wouldn't work (for example because the REST Service isn't accessible for some time) you can store the message for processing them later on.
Depending on what you use as a messagebus you can introduce threads with it.
In my opinion you could better design workflows, handle exceptional states and so on, if you have a more high level abstraction like a service bus.
Also beaucase the parts could run indepdently they don't block each other.
One easy way could be to use servicestack messaging with Redis ServiceBus.
Some advantages quoted from there:
Message-based design allows for easier parallelization and introspection of computations
DLQ messages can be introspected, fixed and later replayed after server updates and rejoin normal message workflow
I think the easy way to start with multiple thread in your case, will be putting the entire operation for each account id in a thread (or better, in a ThreadPool). In the proposed way below, I think you will not need to control inter-thread operations.
Something like this to put the data on the thread pool queue:
var accountIds = new List<int>();
foreach (var accountId in accountIds)
{
ThreadPool.QueueUserWorkItem(ProcessAccount, accountId);
}
And this is the function you will process each account:
public static void ProcessAccount(object accountId)
{
// Download the data file for this account
// ContinueWith using the data file, send to the converter
// ContinueWith check threshold, send to parser
// ContinueWith Generate Report
// ContinueWith Upload outputs
}
I am writing a heavy web scraper in c#. I want it to be fast and reliable.
Parallel.Foreach and Parallel.For are way too slow for this.
For the input I am using a list of URLs. I want to have up to 300 threads working at the exact same time (my cpu and net connection can handle this). What would be the best way to do this? Would using tasks work better for this?
Sometimes the threads end for no apparent reason and some of the results don't get saved. I want a more reliable way of doing this. Any ideas?
I want to have a more solid queue type of scraping.
What I came up with (not all code but the important parts):
List <string> input = // read text file
int total = words.Length;
int maxThreads = 300;
while (true)
{
if (activeThreads < maxThreads)
{
current++;
Thread thread = new Thread(() => CrawlWebsite(words[current]));
thread.Start();
}
}
public static void CrawlWebsite(string word)
{
activeThreads++;
// scraping part
activeThreads--;
}
Consider using System.Threading.ThreadPool. It could be a little faster for your scenario with many threads, as well as you don't need to manage activeThreads. Instead you can use ThreadPool.SetMaxThreads() and SetMinThreads() and the ThreadPool manages the number of parallel threads for you.
BTW, there is missing synchronization of the shared variables in your example. One of the ways how to synchronize access is using "lock" - see http://msdn.microsoft.com/en-us/library/c5kehkcz.aspx
Also your thread-runned method - the CrawlWebsite() should handle ThreadAbortException - see http://msdn.microsoft.com/en-us/library/system.threading.threadabortexception.aspx.
I was recently working on very similar problem and don´t think that using any high number of threads will make it faster. The slowest think is usually downloading the data. Having huge number of threads does not make it faster, because mostly they are waiting for network connections data transfer etc. So I ended up with having two queues. One is handled by some small number of threads that just send async download requests (10-15 requets at a time). The responses are stored into another queue that goes into another thread pool that takes care of parsing and data processing (Number of threads here depends on your CPU and processing algorithm).
I also save all downloaded data to a database. Anytime I want to implement parsing of some new information from the web I don´t need to redownload the content, but only parse the cached web from DB (This saves a looot of time)
I have a simple web application module which basically accepts requests to save a zip file on PageLoad from a mobile client app.
Now, What I want to do is to unzip the file and read the file inside it and process it further..including making entries into a database.
Update: the zip file and its contents will be fairly smaller in size so the server shouldn't be burdened with much load.
Update 2: I just read about when IIS queues requests (at global/app level). So does that mean that I don't need to implement complex request handling mechanism and the IIS can take care of the app by itself?
Update 3: I am looking for offloading the processing of the downloaded zip not only for the sake of minimizing the overhead (in terms of performance) but also in order to avoid the problem of table-locking when the file is processed and records updated into the same table. In the scenario of multiple devices requesting the page and the background task processing database updateing in parallel would cause an exception.
As of now I have zeroed on two solutions:
To implement a concurrent/message queue
To implement the file processing code into a separate tool and schedule a job on the server to check for non-processed file(s) and process them serially.
Inclined towards a Queuing Mechanism I will try to implement is as it seems less dependent on config. v/s manually configuring the job/schedule at the server side.
So, what do you guys recommend me for this purpose?
Moreover after the zip file is requested and saved on server side, the client & server side connection is released after doing so. Not looking to burden my IIS.
Imagine a couple of hundred clients simultaneously requesting the page..
I actually haven't used neither of them before so any samples or how-to's will be more appreciated.
I'd recommend TPL and Rx Extensions: you make your unzipped file list an observable collection and for each item start a new task asynchronously.
I'd suggest a queue system.
When you received a file you'll save the path into a thread-synchronized queue. Meanwhile a background worker (or preferably another machine) will check this queue for new files and dequeue the entry to handle it.
This way you won't launch an unknown amount of threads (every zip file) and can handle the zip files in one location. This way you can also easier move your zip-handling code to another machine when the load gets too heavy. You just need to access a common queue.
The easiest would probably be to use a static Queue with a lock-object. It is the easiest to implement and does not require external resources. But this will result in the queue being lost when your application recycles.
You mentioned losing zip files was not an option, then this approach is not the best if you don't want to rely on external resources. Depending on your load it may be worth to utilize external resources - meaning upload the zip file to a common storage on another machine and add a message to an queue on another machine.
Here's an example with a local queue:
ConcurrentQueue<string> queue = new ConcurrentQueue<string>();
void GotNewZip(string pathToZip)
{
queue.Enqueue(pathToZip); // Added a new work item to the queue
}
void MethodCalledByWorker()
{
while (true)
{
if (queue.IsEmpty)
{
// Supposedly no work to be done, wait a few seconds and check again (new iteration)
Thread.Sleep(TimeSpan.FromSeconds(5));
continue;
}
string pathToZip;
if (queue.TryDequeue(out pathToZip)) // If TryDeqeue returns false, another thread dequeue the last element already
{
HandleZipFile(pathToZip);
}
}
}
This is a very rough example. Whenever a zip arrives, you add the path to the queue. Meanwhile a background worker (or multiple, the example s threadsafe) will handle one zip after another, getting the paths from the queue. The zip files will be handled in the order they arrive.
You need to make sure that your application does not recycle meanwhile. But that's the case with all resources you have on the local machine, they'll be lost when your machine crashes.
I believe you are optimising prematurely.
You mentioned table-locking - what kind of db are you using? If you add new rows or update existing ones most modern databases in most configurations will:
use row-level locking; and
be fast enough without you needing to worry about
locking.
I suggest starting with a simple method
//Unzip
//Do work
//Save results to database
and get some proof it's too slow.
The Problem
I have a constant flow of datapackages. Every time a new package is incoming (100 ms interval) an event pops in which I send the package to a class to process the data and visualize it. Unfortunately, it could happen that the process of a package is aborted if a new package was sent before the current one has been processed.
What I have now
Is a Code that is used as dll for another program.
When I started coding the dll, I was new to c# so I didn't want to do it too complicated. Everything is working but I faced some ugly Frame skips (in my visualization part) if the cpu is very busy.
I have several classes and one of them is handling all packages. This class has round about 50 attributes 25 functions and 1000 lines of code. only 6 functions are needed for the calculations.
The rest is setting the attributes correctly (if the user changes settings).
What I need to change
So now I want to buffer all incoming data by using a List.
The list should be handled by another thread. So it is very unlikely that writing the data to the list takes longer than 100 ms ^^ (2 arrays with ca. 40 elements each that should be equal to nothing)
What I have in mind
Splitting the mentioned class into 2 separate classes.
One that handles the packages and one that handles the settings. So I would split the user and the "program" input.
Creating a thread that uses the "package handling" class to process the buffered data.
What I have no clue about
Since the settings class contains important attributes that are needed by the handling class I don't know how to do it best, because The handling class also needs to change/fill buffers in the settings class. But this one will be invoked by the main thread. Or is it better to not split the setting and handling class and leave it as it is? I am not so familiar with threading and read the first chapter of this free e-book Threading in C#
I would just add a queue and implement a thread for the processing. That will help you with the skips and requires little change. Refactoring the code by splitting settings apart seems like a lot of work with little benefit and possible new bugs.
To add the queue;
Create a ConcurrentQueue (this is threadsafe lifo, which is what you need)
var cq=new ConcurrentQueue<Packets>();
Add all your datapackets to that Queue
cq.Enqueue(newPacket);
Create another thread that loops around and processes the Queue
if (cq.TryDequeue(out newPacket))
{
// Visualize new packet
}