C# - Moving files - to queue or multi-thread

C# - Moving files - to queue or multi-thread - c#

I have an app that moves a project and its files from preview to production using a Flex front-end and a .NET web service. Currently, the process takes about 5-10 mins/per project. Aside from latency concerns, it really shouldn't take that long. I'm wondering whether or not this is a good use-case for multi-threading. Also, considering the user may want to push multiple projects or one right after another, is there a way to queue the jobs.
Any suggestions and examples are greatly appreciated.
Thanks!

Something that does heavy disk IO typically isn't a good candidate for multithreading since the disks can really only do one thing at a time. However, if you're pushing to multiple servers or the servers have particularly good disk subsystems some light threading may be beneficial.

As a note - regardless of whether or not you decide to queue the jobs, you will use multi-threading. Queueing is just one way of handling what is ultimately solved using multi-threading.
And yes, I'd recommend you build a queue to push out each project.

You should compare the speed of your code compared to just copying in Windows (i.e., explorer or command line) vs copying with something advanced like TeraCopy. If your code is significantly slower than Window then look at parts in your code to optimize using a profiler. If your code is about as fast as Windows but slower than TeraCopy, then multithreading could help.
Multithreading is not generally helpful when the operation I/O bound, but copying files involves reading from the disk AND writing over the network. This is two I/O operations, so if you separate them onto different threads, it could increase performance. For something like this you need a producer/consumer setup where you have a Circular queue with one thread reading from disk and writing to the queue, and another thread reading from the queue and writing to the network. It'll be important to keep in mind that the two threads will not run at the same speed, so if the queue gets full, wait before writing more data and if it's empty, wait before writing. Also the locking strategy could have a big impact on performance here and could cause the performance to degrade to slower than a single-threaded implementation.

If you're moving things between just two computers, the network is going to be the bottleneck, so you may want to queue these operations.
Likewise, on the same machine, the I/O is going to be the bottleneck, so you'd want to queue there, too.

You should try using the ThreadPool.
ThreadPool.QueueUserWorkItem(MoveProject, project);

Agreed with everyone over the limited performance of running the tasks in parallel.
If you have full control over your deployment environment, you could use Rhino Queues:
http://ayende.com/Blog/archive/2008/08/01/Rhino-Queues.aspx
This will allow you to produce a queue of jobs asynchronously (say from a WCF service being called from your Silverlight/Flex app) and consume them synchronously.
Alternatively you could use WCF and MSMQ, but the learning curve is greater.

When dealing with multiple files using multiple threads usually IS a good idea in concerns of performance.The main reason is that most disks nowadays support native command queuing.
I wrote an article recently about reading/writing files with multiple files on ddj.com.
See http://www.ddj.com/go-parallel/article/showArticle.jhtml?articleID=220300055.
Also see related question
Will using multiple threads with a RandomAccessFile help performance?
In particular i made the experience that when dealing with very many files it IS a good idea to use a number of threads. In contrary using many thread in many cases does not slow down applications as much as commonly expected.
Having said that i'd say there is no other way to find out than trying all possible different approaches. It depends on very many conditions: Hardware, OS, Drivers etc.

The very first thing you should do is point any kind of profiling tool towards your software. If you can't do that (like, if you haven't got such a tool), insert logging code.
The very first thing you need to do is figure out what is taking a long time to complete, and then why is it taking a long time to complete. That your "copy" operation as a whole takes a long time to complete isn't good enough, you need to pinpoint the reason for this down to a method or a set of methods.
Until you do that, all the other things you can do to your code will likely be guesswork. My experience has taught me that when it comes to performance, 9 out of 10 reasons for things running slow comes as surprises to the guy(s) that wrote the code.
So measure first, then change.
For instance, you might discover that you're in fact reporting progress of copying the file on a byte-per-byte basis, to a GUI, using a synchronous call to the UI, in which case it wouldn't matter how fast the actual copying can run, you'll still be bound by message handling speed.
But that's just conjecture until you know, so measure first, then change.

Related

Why do .NET threads have inferior performance to separate .NET processes?

Lately I've been observing an interesting phenomenon, and before I reengineer my whole software architecture based on it, I'd like to know why this happens, and if it's perhaps possible to make thread performance on par with process performance.
Generally, the task is to download certain data. If we make one process with 6 threads, based on the Parallel library, the downloads take around 10s.
If we, however, make 6 processes, each being single threaded, and download the same data, the whole thing will only take around 6s.
The numbers are thoroughly verified and statistically significant, so do take them for granted.
The observation holds over a large (100s of trials) dataset and I've observed no deviation from this behavior.
Basically, the question is, why a non-synchronizing multithreaded process is slower than a few separate processes with the exact same working code, and how it can be fixed?
Thanks in advance!
Note: I've read similar questions but the answers haven't been satisfactory and practical.

My guess is the same as svick's: you probably have some kind of bottleneck inserted by the runtime.
In general, you can use a tool like Fiddler or Wireshark to see how the 10 downloads are interleaving. In your case, I would expect that there would only be two active at any one time and that once one finishes, another will start immediately.
Before you go and change the setting, you should understand why it's there. It is written into the HTTP spec as suggested client behavior so as to not overwhelm the server. If your code is going to be distributed out to hundreds/thousands/millions of machines, you should consider the effects of 10 simultaneous downloads per client.

What to make parallel? What will make me better? (.net Web Business Application, MVC+SL)

I'm working on a web application framework, which uses MSSQL for data storage, mostly just does CRUD operations (but on arbitrarly complex structures), provides a WCF interface for rich Silverlight admin and has an MVC3 display (and some basic forms like user settings, etc).
It's getting quite good at being able to load, display, edit and save any (reasonably) complex data structure, in a user-friendly way.
But, I'm looking towards the future, and want to expand my capabilities (and it would be fun to learn new things along the way as well...) - so I've decided (in the light of what's coming for C#5...) to try to get some parallel/async optimalization... Now, I haven't even learned TPL and PLinq yet, so I'm happy for any advice there as well.
So my question is, what are possible areas where parallel processing maybe of help, and where does TPL and PLinq help me on that?
My guts tell me, I could try saving branches of a data structure in a parallel way to the database (this is where I'd expect the biggest peformance optimalization), I could perform some complex operations (file upload, mail sending maybe?) in a multithreaded enviroment, etc. Can I build complex SL UI views in parallel on the client? (Creating 60 data-bound fields on a view can cause "blinking"...) Can I create partial views (menus, category trees, search forms, etc) in MVC at once?
ps: If this turns into "Tell me everything about parallel stuffs" thread, I'm happy to make it community-wiki...

Remember that an asp.net web application is intrinsically a parallel application in any case. Requests can be serviced in parallel and this will all be managed by the asp.net framework. So there are two cases:
You have lots of users all hitting the site at once. In which case the parallel processing capability of the server is probably being used to capacity in any case.
You don't have lots of users all hitting the site at once. In which case the server is probably quite capable of dealing with the responses without parallel processing in a suitable fast response time.
Any time you start thinking about optimising something just because it might be fun, or because you just think you should make stuff faster then you are almost certainly guilty of premature optimization. Your efforts could almost certainly be better spent enriching the functionality of the framework, rather than making what is probably a plenty fast enough solution a little bit faster (at the cost of significantly increase complexity).
In answer to the question of where can TPL and PLINQ really help. In my opinion the main advantage of these technologies is in places in the application where you really do have a lot of long running blocking processes. For example if you have a situation where you call out several times to an external web service - it can be a significant advantage to make these calls in parallel. I would strongly question whether writing to a local database - or even a database on a different box on a local network would count as being a long running blocking process to the extent that this kind of parallelisation is of any significant value.
Pretty much all the examples you list fall in to the category of getting the PC to do something in parallel that it was previously doing in sequence. How many CPUs are on your server - how many are really free when the website is under load. Making something parallel does not necessarily equate to making it faster unless the process involved has some measure of time when you PC is sitting around doing nothing waiting for an external event.

First question is to ask the users / testers which bits seem slow. The only way to know for sure what's slowing you down is to use a profiler like dottrace. The results are sometimes surprising.
If you do find something, parallel processing may not be the answer. You need to remember that there is an overhead in splitting tasks up, so if the task is fairly quick in the first place, it could end up being slower. You also have to consider the added complexity, e.g. what happens if half a task succeeds, and half fails? (Although TPL and PLINQ hide you from this to an extend)
Have fun, but I wondering whether this is a case of 1) solution chasing a problem, and 2) premature optimization.

Most efficient way to download thousands of webpages

I have few thousand of items. For each item I need to download a webpage and process this webpage. Processing itself is not processor-intensive.
Right now, I'm doing it synchronously using webclient class, but it takes too long. I'm sure it can be easily paralelized/asynchronized. But Iam looking for most resource-efficient way to do it. There are possibly some limits for amount of active webrequests, so I dont like idea of creating thousands webclients and starting asynchronous operation on each one. Unless it is not an actual problem.
Is it possible to use Parallel Extensions and Task class in C# 4?
Edit: Thanks for the answers. I was hoping for something using asynchronous operations, because running synchronous operation in paralel will only block those thread.

You want to use a structure called a producer/consumer queue. You queue up all your urls for processing, and assign consumer threads to dequeue each url (with appropriate locking) and then download and process it.
This allows you to control and tune the number of consumers for what works best in your situation. In most cases, you'll find the optimum throughput for network operations is achieved with between 5 and 20 active connections. More and you start worrying about congestion issues on the wire or context switching issues among your threads. Of course, it varies depending on your circumstances: a server with a lot of cores and fat pipe might be able to push this number much higher, but an old P4 on dialup might find it does best with just a couple going at a time. That's why the tuning ability is so important.

Try using Parallel.ForEach([list of items], x => YourDownloadFunction(x))
It will handle concurrency automatically and efficiently, using thread pools and the whole lot.

Use Thread. Parallel.ForEach has limited threads, based on amount of cores/cpus you have. Fetching websites doesn't make a thread completely active throughout its operation. There will be delays between requests (images, static content, etc). So, use threads to maximize the speed. Start with 50 threads then go up from there to see how much your computer can handle.

C# Threading in real-world apps

Learning about threading is fascinating no doubt and there are some really good resources to do that. But, my question is threading applied explicitly either as part of design or development in real-world applications.
I have worked on some extensively used and well-architected .NET apps in C# but found no trace of explicit usage.Is there no real need due to this being managed by CLR or is there any specific reason?
Also, any example of threading coded in widely used .NET apps. in Codelplex or Gooogle Code are also welcome.

The simplest place to use threading is performing a long operation in a GUI while keeping the UI responsive.
If you perform the operation on the UI thread, the entire GUI will freeze until it finishes. (Because it won't run a message loop)
By executing it on a background thread, the UI will remain responsive.
The BackgroundWorker class is very useful here.

is threading applied explicitly either as part of design or development in real-world applications.
In order to take full advantage of modern, multi-core systems, threading must be part of the design from the start. While it's fairly easy (especially in .NET 4) to find small portions of code to thread, to get real scalability, you need to design your algorithms to handle being threaded, preferably at a "high level" in your code. The earlier this is done in the design phases, the easier it is to properly build threading into an application.
Is there no real need due to this being managed by CLR or is there any specific reason?
There is definitely a need. Threading doesn't come for free - it must be added in by the developer. The main reason this isn't found very often, especially in open source code, is really more a matter of difficulty. Even using .NET 4, properly designing algorithms to thread in a scalable, safe manner is difficult.

That entirely depends on the application.
For a client app that ever needs to do any significant work (or perform other potentially long-running tasks, such as making web service calls) I'd expect background threads to be used. This could be achieved via BackgroundWorker, explicit use of the thread pool, explicit use of Parallel Extensions, or creating new threads explicitly.
Web services and web applications are somewhat less likely to create their own threads, in my experience. You're more likely to effectively treat each request as having a separate thread (even if ASP.NET moves it around internally) and perform everything synchronously. Of course there are web applications which either execute asynchronously or start threads for other reasons - but I'd say this comes up less often than in client apps.

Definitely a +1 on the Parallel Extensions to .NET. Microsoft has done some great work here to improve the ThreadPool. You used to have one global queue which handled all tasks, even if they were spawned from a worker thread. Now they have a lock-free global queue and local queues for each worker thread. That's a very nice improvement.
I'm not as big a fan of things like Parallel.For, Parallel.Foreach, and Parallel.Invoke (regions), as I believe they should be pure language extensions rather than class libraries. Obviously, I understand why we have this intermediate step, but it's inevitable for C# to gain language improvements for concurrency and it's equally inevitable that we'll have to go back and change our code to take advantage of it :-)
Overall, if you're looking at building concurrent apps in .NET, you owe it to yourself to research the heck out of the Parallel Extensions. I also think, given that this is a pretty nascent effort from Microsoft, you should be very vocal about what works for you and what doesn't, independent of what you perceive your own skill level to be with concurrency. Microsoft is definitely listening, but I don't think there are that many people yet using the Parallel Extensions. I was at VSLive Redmond yesterday and watched a session on this topic and continue to be impressed with the team working on this.
Disclosure: I used to be the Marketing Director for Visual Studio and am now at a startup called Corensic where we're building tools to detect bugs in concurrent apps.

Most real-world usages of threading I've seen is to simply avoid blocking - UI, network, database calls, etc.
You might see it in use as BeginXXX and EndXXX method pairs, delegate.BeginInvoke calls, Control.Invoke calls.
Some systems I've seen, where threading would be a boon, actually use the isolation principle to achieve multiple "threads", in other words, split the work down into completely unrelated chunks and process them all independently of each other - "multi-threading" (or many-core utilisation) is automagically achieved by simply running all the processes at once.
I think it's fair to say you find a lot of stock-and-trade applications (data presentation) largely do not require massive parallisation, nor are they always able to be architected to be suitable for it. The examples I've seen are all very specific problems. This may attribute to why you've not seen any noticable implementations of it.

The question of whether to make use of an explicit threading implementation is normally a design consideration as others have mentioned here. Trying to implement concurrency as an afterthought usually requires a lot of radical and wholesale changes.
Keep in mind that simply throwing threads into an application doesn't inherently increase performance or speed, given that there is a cost in managing each thread, and also perhaps some memory overhead (not to mention, debugging it can be fun).
From my experience, the most common place to implement a threading design has been in Windows Services (background applications) and on applications which have had use case scenarios where a volume of work could be easily split up into smaller parcels of work (and handed off to threads to complete asynchronously).
As for examples, you could check out the Microsoft Robotics Studio (as far as I know there's a free version now) - it comes with an redistributable (I can't find it as a standalone download) of the Concurrency and Coordination Runtime, there's some coverage of it on Microsoft's Channel 9.
As mentioned by others the Parallel Extensions team (blog is here) have done some great work with thread safety and parallel execution and you can find some samples/examples on the MSDN Code site.

Threading is used in all sorts of scenarios, anything network based depends on threading, whether explicit (sockets stuff) or implicit (web services). Threading keeps UI responsive. And windows services having multiple parallel runs doing the same things in processing data working through queues that need to be processed.
Those are just the most common ones I've seen.

Most answers reference long-running tasks in a GUI application. Another very common usage scenario in my experience is Producer/Consumer queues. We have many utility applications that have to perform web requests etc. often to large number of endpoints. We use producer/consumer threading pattern (usually by integrating a custom thread pool) to allow high parallelization of these tasks.
In fact, at this very moment I am checking up on an application that uploads a 200MB file to 200 different FTP locations. We use SmartThreadPool and run up to around 50 uploads in parallel, which allows the whole batch to complete in under one hour (as opposed to over 50 hours were it all uploads to happen consecutively - so in our usage we find almost straight linear improvements in time).

As modern day programmers we love abstractions so we use threads by calling Async methods or BeginInvoke and by using things like BackgroundWorker or PFX in .Net 4.
Yet sometimes there is a need to do the threading yourself. For Example in a web app I built I have a mail queue that I add to from within the app and there is a background thread that sends the emails. If the thread notices that the queue is filling up faster that it is sending it creates another thread if it then sees that that thread is idle it kills it. This can be done with a higher level abstraction I guess but i did it manually.

I can't resist the edge case - in some applications where either a high degree of operational certainty must be achieved or a high degree of operational uncertainty must be tolerated, then threads and processes are considered from initial architecture design all the way through end delivery
Case 1 - for systems that must achieve extremely high levels of operational reliability, three completely separate subsystems using three different mechanisms may be used in a voting architecture - Spawn 3 threads/proceses across each of the voters, wait for them to conclude/die/be killed, and proceed IFF they all say the same thing - example - complex avionic susystems
Case 2 - for systems that must deal with a high degree of operational uncertainty - do the same thing, but once something/anything gets back to you, kill off the stragglers and go forth with the best answer you got - example - complex intraday trading algorithms endeavoring to destroy the business that employ them :-)

What is the easiest way to parallelize my C# program across multiple PCs

I have many unused computers at home. What would be the easiest way for me to utilize them to parallelize my C# program with little or no code changes?
The task I'm trying to do involves looping through lots of english sentences, the dataset can be easily broken into smaller chunks, processed in different machines concurrently.

… with little or no code changes?
Difficult. Basically, look into WCF as a way to communicate between various instances of the program across the network. Depending on the algorithm, the structure might have to be changed drastically, or not at all. In any case, you have to find a way to separete the problem into parts that act independently from each other. Then you have to devise a way of distributing these parts between different instances, and collecting the resulting data.
PLinq offers a great way to parallelize your program without big changes but this only works on one process, across different threads, and then only if the algorithm lends itself to parallelization. In general, some manual refactoring is necessary.

That's probably not possible.
How to parallelize a program depends entirely on what your program does and how it is written, and usually requires extensive code changes and increases the complexity of your program many fold.
The usual way to easily increase concurency in a program is take a task that is repeated many times and just write a function that splits that task into chunks and sends them to different cores to process.

The answer depends on the nature of the work your application will be doing. Different types of work have different possible parallelization solutions. For some types there is no possible/feasible way to parallelize.
The easiest scenario I can think of is for an application which work can easily be broken in discrete job chunks. If this is the case, then you simply design your application to work on a single job chunk. Provide your application with the ability to accept new jobs and deliver the finished jobs. Then, build a job scheduler on top of it. This scheduler can be part of the same application (configure one machine to be the scheduler and the rest as clients), or a separate application.
There are other things to consider: How will occur the communication among machines (files?, network connections?); the application need to be able to report/be_queried about percent of job completed?; there is a need to be able to force the application to stop proccessing current job?; etc.).
If you need a more detailed answer, edit your question and include details about the appplication, the problem the application solves, the expected amount of jobs, etc. Then, the community will come with more specific answers.

Dryad (Microsoft's variation of MapReduce) addresses exactly this problem (parallelize .net programs across multiple PCs). It's in research stage right now.
Too bad there are no CTPs yet :-(

You need to run your application on a distributed system, google for distributed computation windows or for grid computing c#.

Is each sentence processed independently, or are they somehow combined? If your processing operates on a single sentence at a time, you don't need to change your code at all. Just execute the same code on each of your machines and divide the data (your list of sentences) between them. You can do this either by installing a portion of the data on each machine, or by sharing the database and assigning a different chunk to each machine.
If you want to change your code slightly to facilitate parallelism, share the entire database and have the code "mark" each sentence as it's processed, then look for the next unmarked sentence to process. This will give you a gentle introduction to the concept of thread safety -- techniques that ensure one processor doesn't adversely interfere with another.
As always, the more details you can provide about your specific application, the better the SO community can tailor our answers to your purpose.
Good luck -- this sounds like an interesting project!

Before I would invest in parallelizing your program, why not just try breaking the datasets down into pieces and manually run your program on each computer and collate the outputs by hand. If that works, then try automating it with scripts and write a program to collate the outputs.

There are several software solutions that allow you to use commodity based hardware. One is Appistry. I work at Appistry and we have done numerous solutions to run C# applications across hundreds of machines.
A few useful links:
http://www.appistry.com/resource-library/index.html
You can download the product for free here:
http://www.appistry.com/developers/
Hope this helps
-Brett

You might want to look at Flow-Based Programming - it has a Java and a C# implementation. Most approaches to this problem involve trying to take a conventional single-threaded program and figure out which parts can run in parallel. FBP takes a different approach: the application is designed from the start in terms of multiple "black-box" components running asynchronously (think of a manufacturing assembly line). Since a conventional single-threaded program acts like a single component in the FBP environment, it is very easy to extend an existing application. In fact, pieces of an existing app can often be broken off and turned into separate components, provided they can run asynchronously with the rest of the app (i.e. not subroutines). Someone called this "turning an iceberg into ice cubes").

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.