Recommended pattern for downloading multiple files in Monotouch (async/multithreaded)

Recommended pattern for downloading multiple files in Monotouch (async/multithreaded) - c#

I have a MT app that downloads content form the internet (ex - lots of images - 10K to 5MB). One download session can represent gigabytes of data. I have wrapped the download in a Parallel.ForEach loop and that works, but doesn't seem to use any more then one thread on the device for downloading (I would like at least two to reduce the download time).
Note: Parallel.ForEach does create multiple threads in the simulator. Should I just throw all the downloads as tasks into the thread pool? Should I spin up my own queue and threads and bypass the threadpool? I know the threadpool scales to match the device, so that might not be the best option.

When it comes to IO, only the application developer knows how much parallelism he wants. Don't rely on the TPL for that - it knows nothing about IO.
Create the right amount of IO parallelism yourself by starting the correct number of tasks manually, using PLINQ with an exact degree of parallelism or using async IO (which is thread-less).

Are you downloading via HTTP? I've found the WebClient class to work well for the type of thing you're describing.
Something like:
WebClient client = new WebClient();
client.DownloadFileCompleted += new AsyncCompletedEventHandler(client_DownloadFileCompleted);
client.DownloadFileAsync("http://stackoverflow.com", "test.txt");
void client_DownloadFileCompleted(object sender, AsyncCompletedEventArgs e)
{
//file finished downloading
}
This way there's no need to manage the threads yourself.
Also if you want to read the data from the file right away you might just want to use
DownloadDataAsync
And save the file yourself.

Related

C# Sockets Async vs Mulithreading

I am working on a project where I am to extract information continually from multiple servers (fewer than 1000) and write most of the information into a database. I've narrowed down my choices to 2:
Edit: This is a client, so I will be generating the connections and requesting information periodically.
1 - Using the asynchronous approach, create N sockets to poll, decide whether the information will be written into the database on the callback and put the useful information into a buffer. Then write the information from the buffer using a timer.
2 - Using the multithreading approach, create N threads with one socket per thread. The buffer of the useful information would remain on the main thread and so would the cyclic writing.
Both options use in fact multiple threads, only the second one seems to add an extra difficulty of creating each of the threads manually. Are there any merits to it? Is the writing by using a timer wise?

With 1000 connections async IO is usually a good idea because it does not block threads while the IO is in progress. (It does not even use a background thread to wait.) That makes (1) the better alternative.
It is not clear from the question what you would need a timer for. Maybe for buffering writes? That would be valid but it seems to have nothing to do with the question.
Polling has no place in a modern async IO application. The system calls your callback (or completes your IO Task) when it is done. The callback is queued to the thread-pool. This allows you to not worry about that. It just happens.
The code that reads data should look like this:
while (true) {
var msg = await ReadMessageAsync(socket);
if (msg == null) break;
await WriteDataAsync(msg);
}
Very simple. No blocking of threads. No callbacks.

In answer to the "is using a timer wise" question, perhaps it is better to make your buffer autoflush when it reaches either a certain time, or a certain size. This is the way the in-memory cache works in the .NET framework. The cache is set to both a maximum size and a maximum stale-ness.
Resiliancy on failure might be a concern, as well as the possibility that peak loads might blow your buffer if its an in-memory one. You might consider making your buffer local but persistent - for instance using a MSMQ or similar high speed queue technology. I've seen this done successfully, especially if you make the buffer write async (i.e. "fire and forget") it has almost no impact on the ability to service the input queue, and allows the database population code to pull from the persistent buffer(s) whenever it needs to or whenever prompted to.

Another option is to have a dedicated thread whose only job is to service the buffer and write data to the database as fast as it can. So when you make a connection and get data, that data is placed in the buffer. But you have one thread that's always looking at the buffer and writing data to the database as it comes in from the other connections.
Create the buffer as a BlockingCollection< T >. Use asynchronous requests as suggested in a previous answer. And have a single dedicated thread that reads the data and writes it to the database:
BlockingCollection<DataType> _theQueue = new BlockingCollection<DataType>(MaxBufferSize);
// add data with
_theQueue.Add(Dataitem);
// service the queue with a simple loop
foreach (var dataItem in _theQueue.GetConsumingEnumerable())
{
// write dataItem to the database
}
When you want to shut down (i.e. no more data is being read from the servers), you mark the queue as complete for adding. The consumer thread will then empty the queue, note that it's marked as complete for adding, and the loop will exit.
// mark the queue as complete for adding
_theQueue.CompleteAdding();
You need to make the buffer large enough to handle bursts of information.
If writing one record at a time to the database isn't fast enough, you can modify the consumer loop to fill its own internal buffer with some number of records (10? 100? 1000?), and write them to the database all in one shot. How you do that will depend of course on your server. But you should be able to come up with some form of bulk insert that will reduce the number of round trips you make to the database.

For option (1) you could write qualifying information to a queue and then listen on the queue with your database writer. This will allow your database some breathing space during peak loads and avoid the requests backing up waiting for a timer.
A persistent queue would give you some resilience too.

Overriding transfer rate limit while using WebClient to download files

I am using WebClient.DownloadFileTaskAsync() to download files from a website concurrently. For each file, the first x number of bytes downloads at my connection's maximum speed but then slows to a painful 32kbps, causing anything larger than a couple megabytes to take forever to complete. It makes no difference if I'm downloading 1 file or 50.
Is there any way to get around this and have the whole file download without slowing down?
using (WebClient wc = new WebClient())
{
wc.Proxy = null;
wc.DownloadProgressChanged += (s, e) =>
Application.Current.Dispatcher.BeginInvoke((Action)(() =>
{
track.Progress = e.ProgressPercentage;
track.TotalBytes = e.TotalBytesToReceive;
track.BytesReceived = e.BytesReceived;
}));
await wc.DownloadFileTaskAsync(
new Uri(track.FilePath),
string.Format(
"{0}/{1} {2}.mp3",
directory,
track.Number,
track.TitlePath));
}
Update: The files exhibit the same behavior when loaded into a browser so it would seem that this problem isn't local to my application. If anybody has any ideas as to what might be causing this, please let me know.
Update: This seems to be an issue with the website I'm using. Downloads from other websites go at full speed and I tried running the program while connected to a VPN with the same results. All of WebClient's data-grabbing methods behave the same, including OpenRead. Are there any tricks that I could try that might prevent the speed from dropping?

As a point of interest, generally events on asynchronous operations automatically marshal back to the same thread in which the operation was invoked. so, if you invoked DownloadFileTaskAsync on the main GUI thread (or the dispatcher thread, depending on your point of view) then DownloadProgressChanged will be invoked on the main GUI thread and you don't need to use BeginInvoke.
I don't think that's your problem, but one thing that WebClient does do is use thread pool threads to help invoke the events. Unfortunately, this can often cause stress on the thread pool which ends up with throttling issues with the download.
If throughput during the download isn't completely vital, I'd suggest using HttpClient instead.

Will using multiple threads speed up my HTML file processing application?

I just finished up my most complex and feature-laden WinForms application to date. It loads a list any number of HTML files, then loads the content of one, uses some RegEx to match some tags and remove or replace them (yes, yes, I've seen this. It works just fine, thanks Cthulu), then writes it to disk.
However, I noticed that ~200 files takes roughly 30 seconds to process, and after the first 5-10 seconds the program is reported as "Not Responding". I'm assuming it's not wise to do something like this guy did, as the hard drive is a bottleneck.
Perhaps it'd be possible to load as many as possible into memory, then process each one with a thread, write those, then load some more into memory?
At the very least, would creating a worker thread separate from the UI thread prevent the "Not Responding" issue? (This MSDN article covers what I was considering.)
I guess I'm asking if multithreading will offer any sort of speed improvement, and if so, what would be the best way of going about it?
Any help or advice is much appreciated!

Yes, you should start by using a Backgroundworker to decouple your work from the GUI. Handling a GUI event should never take too much time. Aim for 20ms, not 20s.
Then as a bonus you could see if the processing (CPU intensive part) can be split into independent jobs and execute them as TPL Tasks.
There is insufficient information to say if or how you should do that.

Threading jobs, tasks, etc. will, in most cases, prevent the primary, or main thread from becoming non-responsive. Do not create multiple threads for disk IO (obviously). I would dedicate a single worker thread to taking your files off a queue and processing the disk IO. Otherwise, 1 or 2 worker threads to do in-memory processing should be sufficient while your main thread can remain responsive.

First of all, if you want the program to remain responsive move the calculations to a separate thread (remove it from the UI thread).
The actual performance improve depends on the number of processors you have, not the number of threads.
So if you have P threads, you can divide the work to P work items and get some work improvement. (Amdahl's Law)
You can use BackgroundWorker to divide the work properly. : C# BackgroundWorker Tutorial

Why not use StreamReader.ReadAllLines() to read each file into an array, and then process each element of the array?

If you do all your processing in the GUI-thread, your application will show the 'not responding' if it takes very long. In my opinion, you should try to never do (extensive) processing actions in the same thread as your GUI.
In addition, you could even just create a thread for each file to be processed. This will most likeley speed things up, as long as the seperate threads do not need any data from eachother.

Proper Queue threading technique in c#?

I wanted to implement a windows service that captures dropped flat delimited files to a folder for import to the database. What I originally envision is to have a FileSystemWatcher looking over new files imported and creating a new thread for importing.
I wanted to know how I should properly implement an algorithm for this and what technique should I use? Am I going to the right direction?

I developed an product like this for a customer. The service were monitoring a number of folders for new files and when the files were discovered, the files were read, processed (printed on barcode printers), archived and deleted.
We used a "discoverer" layer that discovered files using FileSystemWatcher or polling depending on environment (since FileSystemWatcher is not reliable when monitoring e.g. samba shares), a "file reader" layer and a "processor" layer.
The "discoverer" layer discovered files and put the filenames in a list that the "file reader" layer processed. The "discoverer" layer signaled that there were new files to process by settings an event that the "file reader" layer were waiting on.
The "file reader" layer then read the files (using retry functionality since you may get notifications for new files before the files has been completely written by the process that create the file).
After the "file reader" layer has read the file, a new "processor" thread were created using the ThreadPool.QueueWorkItem to process the file contents.
When the file has been processed, the original file were copied to an archive and deleted from the original location. The archive were also cleaned up regularly to keep from flooding the server. The archive were great for troubleshooting.
This has now been used in production in a number of different environments in over two years now and has proved to be very reliable.

I've fielded a service that does this as well. I poll via a timer whose elapsed event handler acts as a supervisor, adding new files to a queue and launching a configurable number of threads that consume the queue. Once the files are processed, it restarts the timer.
Each thread including the event handler traps and reports all exceptions. The service is always running, and I use a separate UI app to tell the service to start and stop the timer. This approach has been rock solid and the service has never crashed in several years of processing.

The traditional approach is to create a finite set of threads (could be as few as 1) and have them watch a blocking queue. The code in the FileSystemWatcher1 event handlers will enqueue work items while the worker thread(s) dequeue and process them. It might look like the following which uses the BlockingCollection class which is available in .NET 4.0 or as part of the Reactive Extensions download.
Note: The code is left short and concise for brevity. You will have to expand and harden it yourself.
public class Example
{
private BlockingCollection<string> m_Queue = new BlockingCollection<string>();
public Example()
{
var thread = new Thread(Process);
thread.IsBackground = true;
thread.Start();
}
private void FileSystemWatcher_Event(object sender, EventArgs args)
{
string file = GetFilePathFromEventArgs(args);
m_Queue.Add(file);
}
private void Process()
{
while (true)
{
string file = m_Queue.Take();
// Process the file here.
}
}
}
You could take advantage of the Task class in the TPL for a more modern and ThreadPool-like approach. You would start a new task for each file (or perhaps batch them) that needs to be processed. The only gotcha I see with this approach is that it would be harder to control the number of database connections being opened simultaneously. Its definitely not a showstopper and it might be of no concern.
1The FileSystemWatcher has been known to be a little flaky so it is often advised to use a secondary method of discovering file changes in case they get missed by the FileSystemWatcher. Your mileage may vary on this issue.

Creating a thread per message will most likely be too expensive. If you can use .NET 4, you could start a Task for each message. That would run the code on a thread pool thread and thus reduce the overhead of creating threads.
You could also do something similar with asynchronous delegates if .NET 4 is not an option. However, the code gets a bit more complicated in that case. That would utilize the thread pool as well and save you the overhead of creating a new thread for each message.

Async threaded tcp server

I want to create a high performance server in C# which could take about ~10k clients. Now i started writing a TcpServer with C# and for each client-connection i open a new thread. I also use one thread to accept the connections. So far so good, works fine.
The server has to deserialize AMF incoming objects do some logic ( like saving the position of a player ) and send some object back ( serializing objects ). I am not worried about the serializing/deserializing part atm.
My main concern is that I will have a lot of threads with 10k clients and i've read somewhere that an OS can only hold like a few hunderd threads.
Are there any sources/articles available on writing a decent async threaded server ? Are there other possibilties or will 10k threads work fine ? I've looked on google, but i couldn't find much info about design patterns or ways which explain it clearly

You're going to run into a number of problems.
You can't spin up 10,000 threads for a couple of reasons. It'll trash the kernel scheduler. If you're running a 32-bit, then the default stack address space of 1MB means that 10k threads will reserve about 10GB of address space. That'll fail.
You can't use a simple select system either. At it's heart, select is O(N) for the number of sockets. With 10k sockets, that's bad.
You can use IO Completion Ports. This is the scenario they're designed for. To my knowledge there is no stable, managed IO Completion port library. You'll have to write your own using P/Invoke or Managed C++. Have fun.

The way to write an efficient multithreaded server is to use I/O completion ports (using a thread per request is quite inefficient, as #Marcelo mentions).
If you use the asynchronous version of the .NET socket class, you get this for free. See this question which has pointers to documentation.

You want to look into using IO completion ports. You basically have a threadpool and a queue of IO operations.
I/O completion ports provide an
efficient threading model for
processing multiple asynchronous I/O
requests on a multiprocessor system.
When a process creates an I/O
completion port, the system creates an
associated queue object for requests
whose sole purpose is to service these
requests. Processes that handle many
concurrent asynchronous I/O requests
can do so more quickly and efficiently
by using I/O completion ports in
conjunction with a pre-allocated
thread pool than by creating threads
at the time they receive an I/O
request.

You definitely don't want a thread per request. Even if you have fewer clients, the overhead of creating and destroying threads will cripple the server, and there's no way you'll get to 10,000 threads; the OS scheduler will die a horrible death long before then.
There are numerous articles online about asynchronous server programming in C# (e.g., here). Just google around a bit.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Recommended pattern for downloading multiple files in Monotouch (async/multithreaded) - c#

Related

C# Sockets Async vs Mulithreading

Overriding transfer rate limit while using WebClient to download files

Will using multiple threads speed up my HTML file processing application?

Proper Queue threading technique in c#?

Async threaded tcp server

Categories

Resources