Download a file programmatically in multiple chunks in parallel in restartable mode

Download a file programmatically in multiple chunks in parallel in restartable mode - c#

I need to download a large file using HTTP protocol via a quite slow network connection. When doing it manually, the download speed sometimes is unbearably slow and the process sometimes freezes or terminates.
For manual downloads, the situation can be greatly improved by using a download manager (e.g. FDM) — a class of programs that was indispensable and extremely popular a decade or so ago, but whose usage quickly diminishes nowadays because of better and faster networking available — it starts multiple download sessions of the same file in parallel in chunks starting from different positions, automatically restarts failed or stale sessions, implements work balancing (after a successful download of a chunk splits some of the remaining chunks still being downloaded into two sessions), and eventually stitch all downloaded chunks into a complete single file. Overall, it allows to make file downloading robust and much faster on poor connections.
Now I am trying to implement the same download behavior in C# for automatic unattended downloads. I cannot see any of existing classes in .NET framework implementing this, so I am looking for advice how to implement it manually (possibly, with an aid of some open-source .NET libraries).

This is possible using the HttpWebRequest.AddRange method which allows you to get the bytes of a file from a specific range. So when the file exists read the number of bytes and pass it through the HttpWebRequest.AddRange. See a code sample at Codeproject:
http://www.codeproject.com/Tips/307548/Resume-Suppoert-Downloading
Additional information about passing different types of ranges see: http://msdn.microsoft.com/en-us/library/4ds43y3w.aspx

Related

How to use multithreading or any other .NET technology to scale a program performing network, disk and processor intensive jobs?

The Problem:
Download a batch of PDF files from pickup.fileserver (SFTP or windows share) to local hard drive (Polling is involved here to check if files are available to download)
Process (resize, apply barcodes etc) the PDF files, create some metadata files, update database etc
Upload this batch to dropoff.fileserver (SFTP)
Await response from dropoff.fileserver (Again polling is the only option). Once the batch response is available, download it local HD.
Parse the batch response, update database and finally upload report to pickup.fileserver
Archive all batch files to a SAN location and go back to step 1.
The Current Solution
We are expecting many such batches so we have created a windows service which can keep polling at certain time intervals and perform the steps mentioned above. It takes care of one batch at a time.
The Concern
The current solution works file, however, I'm concerned that it is NOT making best use of available resources, there is certainly a lot of room for improvement. I have very little idea about how I can scale this windows service to be able to process as many batches simultaneously as it can. And then if required, how to involve multiple instances of this windows service hosted on different servers to scale further.
I have read some MSDN articles and some SO answers on similar topics. There are suggestions about using producer-consumer patterns (BlockingCollectiong<T> etc.) Some say that it wouldn't make sense to create multi-threaded app for IO intensive tasks. What we have here is a mixture of disk + network + processor intensive tasks. I need to understand how best to use threading or any other technology to make best use of available resources on one server and go beyond one server (if required) to scale further.
Typical Batch Size
We regularly get batches of 200~ files, 300 MB~ total size. # of batches can grow to about 50 to 100, in next year or two. A couple of times in a year, we get batches of 5k to 10k files.

As you say, what you have is a mixture of tasks, and it's probably going to be hard to implement a single pipeline that optimizes all your resources. I would look at breaking this down into 6 services (one per step) that can then be tuned, multiplied or multi-threaded to provide the throughput you need.
Your sources are probably correct that you're not going to improve performance of your network tasks much by multithreading them. By breaking your application into several services, your resizing and barcoding service can start processing a file as soon as it's done downloading, while the download service moves on to downloading the next file.

The current solution works fine
Then keep it. That's my $0.02. Who cares if it's not terribly efficient? As long as it is efficient enough, then why change it?
That said...
I need to understand how best to use threading or any other technology to make best use of available resources on one server
If you want a new toy, I'd recommend using TPL Dataflow. It is designed specifically for wiring up pipelines that contain a mixture of I/O-bound and CPU-bound steps. Each step can be independently parallelized, and TPL Dataflow blocks understand asynchronous code, so they also work well with I/O.
and go beyond one server (if required) to scale further.
That's a totally different question. You'd need to use reliable queues and break the different steps into different processes, which can then run anywhere. This is a good place to start.

According to this article you may implement background worker jobs (Hangfire preferably) in your application layer and reduce code and deployment management of multiple windows services and achieve the same result possibly.
Also, you won't need to bother about handling multiple windows services.
Additionally it can restore in case of failure at application level or restart events.

There is no magic technology that will solve your problem, you need to analyse each part of it step by step.
You will need to profile the application and determine what areas are slow performing and refactor the code to resolve the problem.
This might mean increasing the demand on one resource to decrease demand on another, for example: You might find that you are doing a database lookup 10 times for each file you process. But caching the data before starting processing files is quicker, but maybe only if you have a batch larger than xx files.
You might find that to increase the processing speed of the whole batch that this is maybe not the optimal method for a single file.
As your program has multiple steps then you can look at each of these in turn, and as a whole.
My guess would be that the ftp download and upload would take the most time. So, you can look at running this in parallel. Whether this means running xx threads at once each processing a file, or having a separate task/thread for each stage in your process you can only determine with testing.
A good design is critical for performance. But there are limits and sometimes it just takes time to do some tasks.
Don’t forget that you must weight this up against the time and effort needed to implement this and the benefit. If the service runs overnight and takes 6 hours to run is it really a benefit if it takes 4 hours, if the people who need to work on the result will not be in the office anyway until much later.

To this kind of problem do you have the any specific file types that you download from the SFTP. I have a similar problem in downloading the large files but it is not a windows service in my case its EXE that runs on the System.timers.
Try to create the threads for each file types which are large in
size eg: PDF's.
You can check for these file types while downloading the SFTP file
path and assign them to a thread process to download.
You also need to upload the files also in vice versa.
--In my case all I was able to do was to tweak the existing one and create a separate thread process for a large file types. that solved my problem as flat files and Large PDF files are downloaded parallel threads.

how to copy big amount of small files in high performance?

In my case, I call file.copy() to copy those small files (3KB) from different directories to different directories each. I put the source paths in a list<string>.
What can I do to improve the performance of copying?
Should I use multi-thread? On just the sequential way?

I'm not sure how you could force performance into the issue as copying files is handled by Windows and you will be pretty much limited by the hardware that you have, e.g., the type of storage you are using (disk or SSD), the type of connection you have (LAN/USB 2.0 or 3.0/etc).
That being said an asynchronous file copy would probably work best for what you want to do whatever the scenario is. The best resource for that would be the Asynchronous File I/O reference on MSDN.

"C# how to copy **big amount of small files in high performance**?".
There is not much code in your question, so i am giving you a direction(help url) towards solution..
I would recommended using Zipping as an option. Advantages i feel..
Compression of the size.
Files are reduced to single zip file.
Because of above n/w copying get faster...
NOTE : For .Net 4.5 you can go ahead and use the ZipFile, prior to that you might need to use GZip or Deflate compression algorithms
MSDN url - http://msdn.microsoft.com/en-us/library/ms404280(v=vs.110).aspx

There may be some advantage to reading them into memory in large chunks (say, 1GB at a time) and then writing them out. This could allow them to be written to disk sequentially without having to move the disk heads back and forth so much.
Also consider using the more flexible CopyFileEx API as described here. It allows you to specify NoBuffering to avoid filling up your cache with files you don't care about, plus offers asynchronous modes, cancellation, and progress reporting.

I think You should copy file Asynchronously. Please find the below link that will help you to make an Asynchronous call for copying file.
Asynchronous Programming Techniques

Efficient log backup program in C#

I am writing a log backup program in C#. The main objective is to take logs from multiple servers, copy and compress the files and then move them to a central data storage server. I will have to move about 270Gb of data every 24 hours. I have a dedicated server to run this job and a LAN of 1Gbps. Currently I am reading lines from a (text)file, copying them into a buffer stream and writing them to the destination.
My last test copied about 2.5Gb of data in 28 minutes. This will not do. I will probably thread the program for efficiency, but I am looking for a better method to copy the files.
I was also playing with the idea of compressing everything first and then using a stream buffer a bit to copy. Really, I am just looking for a little advice from someone with more experience than me.
Any help is appreciated, thanks.

You first need to profile as Umair said so that you can figure out how much of the 28 minutes is spent compressing vs. transmitting. Also measure the compression rate (bytes/sec) with different compression libraries, and compare your transfer rate against other programs such as Filezilla to see if you're close to your system's maximum bandwidth.
One good library to consider is DotNetZip, which allows you to zip to a stream, which can be handy for large files.
Once you get it fine-tuned for one thread, experiment with several threads and watch your processor utilization to see where the sweet spot is.

One of the solutions can be is what you mantioned: compress files in one Zip file and after transfer them via network. This will bemuch faster as you are transfering one file and often on of principal bottleneck during file transfers is Destination security checks.
So if you use one zip file, there should be one check.
In short:
Compress
Transfer
Decompress (if you need)
This already have to bring you big benefits in terms of performance.

Compress the logs at source and use TransmitFile (that's a native API - not sure if there's a framework equivalent, or how easy it is to P/Invoke this) to send them to the destination. (Possibly HttpResponse.TransmitFile does the same in .Net?)
In any event, do not read your files linewise - read the files in blocks (loop doing FileStream.Read for 4K - say - bytes until read count == 0) and send that direct to the network pipe.

Trying profiling your program... bottleneck is often where you least expect it to be. As some clever guy said "Premature optimisation is the root of all evil".
Once in a similar scenario at work, I was given the task to optimise the process. And after profiling the bottleneck was found to be a call to sleep function (which was used for synchronisation between thread!!!! ).

Logic in Disk Defragmantation & Disk Check

What is the logic behind disk defragmentation and Disk Check in Windows? Can I do it using C# coding?

For completeness sake, here's a C# API wrapper for defragmentation:
http://blogs.msdn.com/jeffrey_wall/archive/2004/09/13/229137.aspx
Defragmentation with these APIs is (supposed to be) very safe nowadays. You shouldn't be able to corrupt the file system even if you wanted to.
Commercial defragmentation programs use the same APIs.

Look at Defragmenting Files at msdn for possible API helpers.
You should carefully think about using C# for this task, as it may introduce some undesired overhead for marshaling into native Win32.

If you don't know the logic for defragmentation, and if you didn't write the file system yourself so you can't authoritatively check it for errors, why not just start new processes running 'defrag' and 'chkdsk'?

Mark Russinovich wrote an article Inside Windows NT Disk Defragmentation a while ago which gives in-depth details. If you really want to do this I would really advise you to use the built-in facilities for defragmenting. More so, on recent OSes I have never seen a need as a user to even care about defragmenting; it will be done automatically on a schedule and the NTFS folks at MS are definitely smarter at that stuff than you (sorry, but they do this for some time now, you don't).

Despite its importance, the file system is no more than a data structure that maps file names into lists of disk blocks. And keeps track of meta-information such as the actual length of the file and special files that keep lists of files (e.g., directories). A disk checker verifies that the data structure is consistent. That is, every disk block must either be free for allocation to a file or belong to a single file. It can also check for certain cases where a set of disk blocks appears to be a file that should be in a directory but is not for some reason.
Defragmentation is about looking at the lists of disk blocks assigned to each file. Files will generally load faster if they use a contiguous set of blocks rather than ones scattered all over the disk. And generally the entire file system will perform best if all the disk blocks in use confine themselves to a single congtiguous range of the disk. Thus the trick is moving disk blocks around safely to achieve this end while not destroying the file system.
The major difficulty here is running these application while a disk is in use. It is possible but one has to be very, very, very careful not to make some kind of obvious or extremely subtle error and destroy most or all of the files. It is easier to work on a file system offline.
The other difficulty is dealing with the complexities of the file system. For example, you'd be much better off building something that supports FAT32 rather than NTFS because the former is a much, much simpler file system.
As long as you have low-level block access and some sensible way for dealing with concurrency problems (best handled by working on the file system when it is not in use) you can do this in C#, perl or any language you like.
BUT BE VERY CAREFUL. Early versions of the program will destroy entire file systems. Later versions will do so but only under obscure circumstances. And users get extremely angry and litigious if you destroy their data.

What's the best way to read and parse a large text file over the network?

I have a problem which requires me to parse several log files from a remote machine.
There are a few complications:
1) The file may be in use
2) The files can be quite large (100mb+)
3) Each entry may be multi-line
To solve the in-use issue, I need to copy it first. I'm currently copying it directly from the remote machine to the local machine, and parsing it there. That leads to issue 2. Since the files are quite large copying it locally can take quite a while.
To enhance parsing time, I'd like to make the parser multi-threaded, but that makes dealing with multi-lined entries a bit trickier.
The two main issues are:
1) How do i speed up the file transfer (Compression?, Is transferring locally even neccessary?, Can I read an in use file some other way?)
2) How do i deal with multi-line entries when splitting up the lines among threads?
UPDATE: The reason I didnt do the obvious parse on the server reason is that I want to have as little cpu impact as possible. I don't want to affect the performance of the system im testing.

If you are reading a sequential file you want to read it in line by line over the network. You need a transfer method capable of streaming. You'll need to review your IO streaming technology to figure this out.
Large IO operations like this won't benefit much by multithreading since you can probably process the items as fast as you can read them over the network.
Your other great option is to put the log parser on the server, and download the results.

The better option, from the perspective of performance, is going to be to perform your parsing at the remote server. Apart from exceptional circumstances the speed of your network is always going to be the bottleneck, so limiting the amount of data that you send over your network is going to greatly improve performance.
This is one of the reasons that so many databases use stored procedures that are run at the server end.
Improvements in parsing speed (if any) through the use of multithreading are going to be swamped by the comparative speed of your network transfer.
If you're committed to transferring your files before parsing them, an option that you could consider is the use of on-the-fly compression while doing your file transfer.
There are, for example, sftp servers available that will perform compression on the fly.
At the local end you could use something like libcurl to do the client side of the transfer, which also supports on-the-fly decompression.

The easiest way considering you are already copying the file would be to compress it before copying, and decompress once copying is complete. You will get huge gains compressing text files because zip algorithms generally work very well on them. Also your existing parsing logic could be kept intact rather than having to hook it up to a remote network text reader.
The disadvantage of this method is that you won't be able to get line by line updates very efficiently, which are a good thing to have for a log parser.

I guess it depends on how "remote" it is. 100MB on a 100Mb LAN would be about 8 secs...up it to gigabit, and you'd have it in around 1 second. $50 * 2 for the cards, and $100 for a switch would be a very cheap upgrade you could do.
But, assuming it's further away than that, you should be able to open it with just read mode (as you're reading it when you're copying it). SMB/CIFS supports file block reading, so you should be streaming the file at that point (of course, you didn't actually say how you were accessing the file - I'm just assuming SMB).
Multithreading won't help, as you'll be disk or network bound anyway.

Use compression for transfer.
If your parsing is really slowing you down, and you have multiple processors, you can break the parsing job up, you just have to do it in a smart way -- have a deterministic algorithm for which workers are responsible for dealing with incomplete records. Assuming you can determine that a line is part of a middle of a record, for example, you could break the file into N/M segments, each responsible for M lines; when one of the jobs determines that its record is not finished, it just has to read on until it reaches the end of the record. When one of the jobs determines that it's reading a record for which it doesn't have a beginning, it should skip the record.

If you can copy the file, you can read it. So there's no need to copy it in the first place.
EDIT: use the FileStream class to have more control over the access and sharing modes.
new FileStream("logfile", FileMode.Open, FileAccess.Read, FileShare.ReadWrite)
should do the trick.

I've used SharpZipLib to compress large files before transferring them over the Internet. So that's one option.
Another idea for 1) would be to create an assembly that runs on the remote machine and does the parsing there. You could access the assembly from the local machine using .NET remoting. The remote assembly would need to be a Windows service or be hosted in IIS. That would allow you to keep your copies of the log files on the same machine, and in theory it would take less time to process them.

i think using compression (deflate/gzip) would help

The given answer do not satisfy me and maybe my answer will help others to not think it is super complicated or multithreading wouldn't benefit in such a scenario. Maybe it will not make the transfer faster but depending on the complexity of your parsing it may make the parsing/or analysis of the parsed data faster.
It really depends upon the details of your parsing. What kind of information do you need to get from the log files? Are these information like statistics or are they dependent on multiple log message?
You have several options:
parse multiple files at the same would be the easiest I guess, you have the file as context and can create one thread per file
another option as mentioned before is use compression for the network communication
you could also use a helper that splits the log file into lines that belong together as a first step and then with multiple threads process these blocks of lines; the parsing of this depend lines should be quite easy and fast.
Very important in such a scenario is to measure were your actual bottleneck is. If your bottleneck is the network you wont benefit of optimizing the parser too much. If your parser creates a lot of objects of the same kind you could use the ObjectPool pattern and create objects with multiple threads. Try to process the input without allocating too much new strings. Often parsers are written by using a lot of string.Split and so forth, that is not really as fast as it could be. You could navigate the Stream by checking the coming values without reading the complete string and splitting it again but directly fill the objects you will need after parsing is done.
Optimization is almost always possible, the question is how much you get out for how much input and how critical your scenario is.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.