I am using WebClient.DownloadFileTaskAsync() to download files from a website concurrently. For each file, the first x number of bytes downloads at my connection's maximum speed but then slows to a painful 32kbps, causing anything larger than a couple megabytes to take forever to complete. It makes no difference if I'm downloading 1 file or 50.
Is there any way to get around this and have the whole file download without slowing down?
using (WebClient wc = new WebClient())
{
wc.Proxy = null;
wc.DownloadProgressChanged += (s, e) =>
Application.Current.Dispatcher.BeginInvoke((Action)(() =>
{
track.Progress = e.ProgressPercentage;
track.TotalBytes = e.TotalBytesToReceive;
track.BytesReceived = e.BytesReceived;
}));
await wc.DownloadFileTaskAsync(
new Uri(track.FilePath),
string.Format(
"{0}/{1} {2}.mp3",
directory,
track.Number,
track.TitlePath));
}
Update: The files exhibit the same behavior when loaded into a browser so it would seem that this problem isn't local to my application. If anybody has any ideas as to what might be causing this, please let me know.
Update: This seems to be an issue with the website I'm using. Downloads from other websites go at full speed and I tried running the program while connected to a VPN with the same results. All of WebClient's data-grabbing methods behave the same, including OpenRead. Are there any tricks that I could try that might prevent the speed from dropping?
As a point of interest, generally events on asynchronous operations automatically marshal back to the same thread in which the operation was invoked. so, if you invoked DownloadFileTaskAsync on the main GUI thread (or the dispatcher thread, depending on your point of view) then DownloadProgressChanged will be invoked on the main GUI thread and you don't need to use BeginInvoke.
I don't think that's your problem, but one thing that WebClient does do is use thread pool threads to help invoke the events. Unfortunately, this can often cause stress on the thread pool which ends up with throttling issues with the download.
If throughput during the download isn't completely vital, I'd suggest using HttpClient instead.
Related
I have a MT app that downloads content form the internet (ex - lots of images - 10K to 5MB). One download session can represent gigabytes of data. I have wrapped the download in a Parallel.ForEach loop and that works, but doesn't seem to use any more then one thread on the device for downloading (I would like at least two to reduce the download time).
Note: Parallel.ForEach does create multiple threads in the simulator. Should I just throw all the downloads as tasks into the thread pool? Should I spin up my own queue and threads and bypass the threadpool? I know the threadpool scales to match the device, so that might not be the best option.
When it comes to IO, only the application developer knows how much parallelism he wants. Don't rely on the TPL for that - it knows nothing about IO.
Create the right amount of IO parallelism yourself by starting the correct number of tasks manually, using PLINQ with an exact degree of parallelism or using async IO (which is thread-less).
Are you downloading via HTTP? I've found the WebClient class to work well for the type of thing you're describing.
Something like:
WebClient client = new WebClient();
client.DownloadFileCompleted += new AsyncCompletedEventHandler(client_DownloadFileCompleted);
client.DownloadFileAsync("http://stackoverflow.com", "test.txt");
void client_DownloadFileCompleted(object sender, AsyncCompletedEventArgs e)
{
//file finished downloading
}
This way there's no need to manage the threads yourself.
Also if you want to read the data from the file right away you might just want to use
DownloadDataAsync
And save the file yourself.
I wanted to implement a windows service that captures dropped flat delimited files to a folder for import to the database. What I originally envision is to have a FileSystemWatcher looking over new files imported and creating a new thread for importing.
I wanted to know how I should properly implement an algorithm for this and what technique should I use? Am I going to the right direction?
I developed an product like this for a customer. The service were monitoring a number of folders for new files and when the files were discovered, the files were read, processed (printed on barcode printers), archived and deleted.
We used a "discoverer" layer that discovered files using FileSystemWatcher or polling depending on environment (since FileSystemWatcher is not reliable when monitoring e.g. samba shares), a "file reader" layer and a "processor" layer.
The "discoverer" layer discovered files and put the filenames in a list that the "file reader" layer processed. The "discoverer" layer signaled that there were new files to process by settings an event that the "file reader" layer were waiting on.
The "file reader" layer then read the files (using retry functionality since you may get notifications for new files before the files has been completely written by the process that create the file).
After the "file reader" layer has read the file, a new "processor" thread were created using the ThreadPool.QueueWorkItem to process the file contents.
When the file has been processed, the original file were copied to an archive and deleted from the original location. The archive were also cleaned up regularly to keep from flooding the server. The archive were great for troubleshooting.
This has now been used in production in a number of different environments in over two years now and has proved to be very reliable.
I've fielded a service that does this as well. I poll via a timer whose elapsed event handler acts as a supervisor, adding new files to a queue and launching a configurable number of threads that consume the queue. Once the files are processed, it restarts the timer.
Each thread including the event handler traps and reports all exceptions. The service is always running, and I use a separate UI app to tell the service to start and stop the timer. This approach has been rock solid and the service has never crashed in several years of processing.
The traditional approach is to create a finite set of threads (could be as few as 1) and have them watch a blocking queue. The code in the FileSystemWatcher1 event handlers will enqueue work items while the worker thread(s) dequeue and process them. It might look like the following which uses the BlockingCollection class which is available in .NET 4.0 or as part of the Reactive Extensions download.
Note: The code is left short and concise for brevity. You will have to expand and harden it yourself.
public class Example
{
private BlockingCollection<string> m_Queue = new BlockingCollection<string>();
public Example()
{
var thread = new Thread(Process);
thread.IsBackground = true;
thread.Start();
}
private void FileSystemWatcher_Event(object sender, EventArgs args)
{
string file = GetFilePathFromEventArgs(args);
m_Queue.Add(file);
}
private void Process()
{
while (true)
{
string file = m_Queue.Take();
// Process the file here.
}
}
}
You could take advantage of the Task class in the TPL for a more modern and ThreadPool-like approach. You would start a new task for each file (or perhaps batch them) that needs to be processed. The only gotcha I see with this approach is that it would be harder to control the number of database connections being opened simultaneously. Its definitely not a showstopper and it might be of no concern.
1The FileSystemWatcher has been known to be a little flaky so it is often advised to use a secondary method of discovering file changes in case they get missed by the FileSystemWatcher. Your mileage may vary on this issue.
Creating a thread per message will most likely be too expensive. If you can use .NET 4, you could start a Task for each message. That would run the code on a thread pool thread and thus reduce the overhead of creating threads.
You could also do something similar with asynchronous delegates if .NET 4 is not an option. However, the code gets a bit more complicated in that case. That would utilize the thread pool as well and save you the overhead of creating a new thread for each message.
So I've been told what I'm doing here is wrong, but I'm not sure why.
I have a webpage that imports a CSV file with document numbers to perform an expensive operation on. I've put the expensive operation into a background thread to prevent it from blocking the application. Here's what I have in a nutshell.
protected void ButtonUpload_Click(object sender, EventArgs e)
{
if (FileUploadCSV.HasFile)
{
string fileText;
using (var sr = new StreamReader(FileUploadCSV.FileContent))
{
fileText = sr.ReadToEnd();
}
var documentNumbers = fileText.Split(new[] {',', '\n', '\r'}, StringSplitOptions.RemoveEmptyEntries);
ThreadStart threadStart = () => AnotherClass.ExpensiveOperation(documentNumbers);
var thread = new Thread(threadStart) {IsBackground = true};
thread.Start();
}
}
(obviously with some error checking & messages for users thrown in)
So my three-fold question is:
a) Is this a bad idea?
b) Why is this a bad idea?
c) What would you do instead?
A possible problem is that your background thread is running in your web sites application pool. IIS may decide to recycle your application pool causing the expensive operation to be killed before it is done.
I would rather go for an option where I had a separate process, possibly a windows service, that would get the expensive operation requests and perform them outside the asp.net process. Not only would this mean that your expensive operation would survive an application pool restart, but it would also simplify your web application since it didn't have to handle the processing.
Telling the service to perform the expensive process could be done using some sort of inter-process communication, the service could poll a database table or a file, or you could use a management queue that the service would listen to.
There are many ways to do this, but my main point is that you should separate the expensive process from your web application if possible.
I recommend you use the BackgroundWorker class instead of using threads directly. This is because BackgroundWorker is designed specifically to perform background operations for a graphical application, and (among other things) provides mechanisms to communicate updates to the user interface.
a: yes.
Use the ThreadPool;) Queue a WorkItem - avoids the overhead of generating tons of threads.
Here's the setup: I'm trying to make a relatively simple Winforms app, a feed reader using the FeedDotNet library. The question I have is about using the threadpool. Since FeedDotNet is making synchronous HttpWebRequests, it is blocking the GUI thread. So the best thing seemed like putting the synchronous call on a ThreadPool thread, and while it is working, invoke the controls that need updating on the form. Some rough code:
private void ThreadProc(object state)
{
Interlocked.Increment(ref updatesPending);
// check that main form isn't closed/closing so that we don't get an ObjectDisposedException exception
if (this.IsDisposed || !this.IsHandleCreated) return;
if (this.InvokeRequired)
this.Invoke((MethodInvoker)delegate
{
if (!marqueeProgressBar.Visible)
this.marqueeProgressBar.Visible = true;
});
ThreadAction t = state as ThreadAction;
Feed feed = FeedReader.Read(t.XmlUri);
Interlocked.Decrement(ref updatesPending);
if (this.IsDisposed || !this.IsHandleCreated) return;
if (this.InvokeRequired)
this.Invoke((MethodInvoker)delegate { ProcessFeedResult(feed, t.Action, t.Node); });
// finished everything, hide progress bar
if (updatesPending == 0)
{
if (this.IsDisposed || !this.IsHandleCreated) return;
if (this.InvokeRequired)
this.Invoke((MethodInvoker)delegate { this.marqueeProgressBar.Visible = false; });
}
}
this = main form instance
updatesPending = volatile int in the main form
ProcessFeedResult = method that does some operations on the Feed object. Since a threadpool thread can't return a result, is this an acceptable way of processing the result via the main thread?
The main thing I'm worried about is how this scales. I've tried ~250 requests at once. The max number of threads I've seen was around 53 and once all threads were completed, back to 21. I recall in one exceptional instance of me playing around with the code, I had seen it rise as high as 120. This isn't normal, is it? Also, being on Windows XP, I reckon that with such high number of connections, there would be a bottleneck somewhere. Am I right?
What can I do to ensure maximum efficiency of threads/connections?
Having all these questions also made me wonder whether this is the right case for a Threadpool use. MSDN and other sources say it should be used for "short-lived" tasks. Is 1-2 seconds "short-lived" enough, considering I'm on a relatively fast connection? What if the user is on a 56K dial-up and one request could take from 5-12 seconds and ever more. Would the threadpool be an efficient solution then too?
The ThreadPool, unchecked is probably a bad idea.
Out of the box you get 250 threads in the threadpool per cpu.
Imagine if in a single burst you flatten out someones net connection and get them banned from getting notifications from a site cause they are suspected to be running a DoS attack.
Instead, when downloading stuff from the net you should build in tons of control. The user should be able to decide how many concurrent requests they make (and how many concurrent requests per domain), ideally you also want to offer controls for the amount of bandwidth.
Though this could be orchestrated with the ThreadPool, having dedicated threads or using something like a bunch of instances of the BackgroundWorker class is a better option.
My understanding of the ThreadPool is that it is designed for this type of situation. I think the definition of short-lived is of this order of time - perhaps even up to minutes. A "long-lived" thread would be one that was alive for the lifetime of the application.
Don't forget Microsoft would have spent some getting the efficiency of the ThreadPool as high as it could. Do you think that you could write something that was more efficient? I know I couldn't.
The .NET thread pool is designed specifically for executing short-running tasks for which the overhead of creating a new thread would negate the benefits of creating a new thread. It is not designed for tasks which block for prolonged periods or have a long execution time.
The idea is to for a task to hop onto a thread, run quickly, complete and hop off.
The BackgroundWorker class provides an easy way to execute tasks on a thread pool thread, and provides mechanisms for the task to report progress and handle cancel requests.
In this MSDN article on the BackgroundWorker Component, file downloads are explicitly given as examples of the appropriate use of this class. That should hopefully encourage you to use this class to perform the work you need.
If you're worried about overusing the thread pool, you can be assured the runtime does manage the number of available threads based on demand. Tasks are queued on the thread pool for execution. When a thread becomes available to do work, the task is loaded onto the thread. At regular intervals, a monitoring process checks the state of the thread pool. If there are tasks waiting to be executed, it can create more threads. If there are several idle threads, it can shut down some to release resources.
In a worse-case scenario, where all threads are busy and you have work queued up, the runtime will be adding threads to deal with the extra workload. The application will be running more slowly as it has to wait for more threads to be made available, but it will continue to run.
A few points, and to combine info form a few other answers:
your ThreadProc does not contain Exception handling. You should add that or 1 I/O error will halt your process.
Sam Saffron is quite right that you should limit the number of threads. You could use a (ThreadSafe) Queue to push your feeds into (WorkItems) and have 1+ threads reading from the queue in a loop.
The BackgrounWorker might be a good idea, it would provide you with both the Exception handling and Synchronization you need.
And the BackgrounWorker uses the ThreadPool, and that is fine
You may want to take a look to the "BackgroundWorker" class.
A few words about an ongoing design and implementation
I send a lot of requests to the remote application (running on a different
host, of course), and the application send back data.
About client
Client is a UI that spawn a separate thread to submit and process the requests. Once it submits all the requests, it calls Wait. And the Wait will parse all events coming the app and invoke client's callbacks.
Below is the implementation of Wait.
public void Wait (uint milliseconds)
{
while(_socket.IsConnected)
{
if (_socket.Poll(milliseconds, SelectMode.SelectRead))
{
// read info of the buffer and calls registered callbacks for the client
if(_socket.IsAvailable > 0)
ProcessSocket(socket);
}
else
return; //returns after Poll has expired
}
}
The Wait is called from a separate thread, responsible for managing network connection: both inbound and outbound traffic:
_Receiver = new Thread(DoWork);
_Receiver.IsBackground = true;
_Receiver.Start(this);
This thread is created from UI component of the application.
The issue:
client sometimes sees delays in callbacks even though main application has sent the data on time. Notably, one the message in Poll was delayed until I client disconnected, and internally I called:
_socket.Shutdown(SocketShutdown.Both);
I think something funky is happening in the Poll
Any suggestions on how to fix the issue or an alternative workaround?
Thanks
please let me know if anything is unclear
A couple of things. First, in your example, is there a difference between "_socket" and "socket"? Second, you are using the System.Net.Sockets.Socket class, right? I don't see IsConnected or IsAvailable properties on that class in the MSDN documentation for any .NET version going back to 1.1. I assume these are both typing mistakes, right?
Have you tried putting an "else" clause on the "IsAvailable > 0" test and writing a message to the Console/Output window, e.g.,
if (_socket.IsAvailable > 0) {
ProcessSocket(socket);
} else {
Console.WriteLine("Poll() returned true but there is no data");
}
This might give you an idea of what might be going on in the larger context of your program.
Aside from that, I'm not a big fan of polling sockets for data. As an alternative, is there a reason not to use the asynchronous Begin/EndReceive functions on the socket? I think it'd be straightforward to convert to the asynchronous model given the fact that you're already using a separate thread to send and receive your data. Here is an example from MSDN. Additionally, I've added the typical implementation that I use of this mechanism to this SO post.
What thread is calling the Wait() method? If you're just throwing it into the UI threadpool, that may be why you experience delays sometimes. If this is your problem, then either use the system threadpool, create a new one just for the networking parts of your application, or spawn a dedicated thread for it.
Beyond this, it's hard to help you much without seeing more code.