I'm writing an application that uploads large files to a web service using HttpWebRequest.
This application will be run by various people with various internet speeds.
I asynchronously read the file in chunks, and asynchronously write those chunks to the request stream. I do this in a loop using callbacks. And I keep doing this until the whole file has been sent.
The speed of the upload is calculated between writes and the GUI is subsequently updated to show said speed.
The issue I'm facing is deciding on a buffer size. If I make it too large, users with slow connections will not see frequent updates to the speed. If I make it too small, users with fast connections will end up "hammering" the read/write methods causing CPU usage to spike.
What I'm doing now is starting the buffer off at 128kb, and then every 10 writes I check the average write speed of those 10 writes, and if it's under a second I increase the buffer size by 128kb. I also shrink the buffer in a similar fashion if the write speed drops below 5 seconds.
This works quite well, but it all feels very arbitrary and it seems like there is room for improvement. My question is, has anybody dealt with a similar situation and what course of action did you take?
Thanks
I think this is a good approach. I too used in large file upload. But there was a small tweek in that. I determined the connection speed in the first request by placing a call to my different service. This would actually save the overhead of recalculating the speed with every request. The primary reason of doing so was
In slow connection, the speed usually fluctuate very much. Thus recalculating it every request does not make sense.
I was supposed to provide resume facility also where the user would be able to reupload the file from the point where it ended last time.
Considering the scalability, I used to fix the buffer with the first request. Let me know if it helped
Related
A little background...
Everything I'm about to describe up to my implementation of the StreamWriter is business processes which I cannot change.
Every month I pull around 200 different tables of data into individual files.
Each file contains roughly 400,000 lines of business logic details for upwards of 5,000-6,000 different business units.
To effectively use that data with the tools on hand, I have to break down those files into individual files for each business unit...
200 files x 5000 business units per file = 100,000 different files.
The way I've BEEN doing it is the typical StreamWriter loop...
foreach(string SplitFile in List<BusinessFiles>)
{
using (StreamWriter SW = new StreamWriter(SplitFile))
{
foreach(var BL in g)
{
string[] Split1 = BL.Split(',');
SW.WriteLine("{0,-8}{1,-8}{2,-8}{3,-8}{4,-8}{5,-8}{6,-8}{7,-8}{8,-8}{9,-16}{10,-1}",
Split1[0], Split1[1], Split1[2], Split1[3], Split1[4], Split1[5], Split1[6], Split1[7], Split1[8], Convert.ToDateTime(Split1[9]).ToString("dd-MMM-yyyy"), Split1[10]);
}
}
}
The issue with this is, It takes an excessive amount of time.
Like, It can take 20 mins to process all the files sometimes.
Profiling my code shows me that 98% of the time spent is on the system disposing of the StreamWriter after the program leaves the loop.
So my question is......
Is there a way to keep the underlying Stream open and reuse it to write a different file?
I know I can Flush() the Stream but I can't figure out how to get it to start writing to another file altogether. I can't seem to find a way to change the destination filename without having to call another StreamWriter.
Edit:
A picture of what it shows when I profile the code
Edit 2:
So after poking around a bit more I started looking at it a different way.
First thing is, I already had the reading of the one file and writing of the massive amount of smaller files in a nested parallel loop so I was essentially maxing out my I/O as is.
I'm also writing to an SSD, so all those were good points.
Turns out I'm reading the 1 massive file and writing ~5600 smaller ones every 90 seconds or so.
That's 60 files a second. I guess I can't really ask for much more than that.
This sounds about right. 100,000 files in 20 minutes is more than 83 files every second. Disk I/O is pretty much the slowest thing you can do within a single computer. All that time in the Dispose() method is waiting for the buffer to flush out to disk while closing the file... it's the actual time writing the data to your persistent storage, and a separate using block for each file is the right way to make sure this is done safely.
To speed this up it's tempting to look at asynchronous processing (async/await), but I don't think you'll find any gains there; ultimately this is an I/O-bound task, so optimizing for your CPU scheduling might even make things worse. Better gains could be available if you can change the output to write into a single (indexed) file, so the operating system's disk buffering mechanism can be more efficient.
I would agree with Joel that the time is mostly due to writing the data out to disk. I would however be a little bit more optimistic about doing parallel IO, since SSDs are better able to handle higher loads than regular HDDs. So I would try a few things:
1. Doing stuff in parallel
Change your outer loop to a parallel one
Parallel.ForEach(
myBusinessFiles,
new ParallelOptions(){MaxDegreeOfParallelism = 2},
SplitFile => {
// Loop body
});
Try changing the degree of parallelism to see if performance improves or not. This assumes the data is thread safe.
2. Try writing high speed local SSD
I'm assuming you are writing to a network folder, this will add some additional latency, so you might try to write to a local disk. If you are already doing that, consider getting a faster disk. If you need to move all the files to a network drive afterwards, you will likely not gain anything, but it can give an idea about the penalty you get from the network.
3. Try writing to a Zip Archive
There are zip archives that can contain multiple files inside it, while still allowing for fairly easy access of an individual file. This could help improve performance in a few ways:
Compression. I would assume your data is fairly easy to compress, so you would write less data overall.
Less file system operations. Since you are only writing to a single file you would avoid some overhead with the file system.
Reduced overhead due to cluster size. Files have a minimum size, this can cause a fair bit of wasted space for small files. Using an archive avoids this.
You could also try saving each file in an individual zip-archive, but then you would mostly benefit from the compression.
Responding to your question, you have an option (add a flag on the constructor) but it is strongly tied to the garbage collector, also think about multi thread environment it could be a mess. That said this is the overloaded constructor:
StreamWriter(Stream, Encoding, Int32, Boolean)
Initializes a new instance of the StreamWriter class for the specified stream by using the specified encoding and buffer size, and optionally leaves the stream open.
public StreamWriter (System.IO.Stream stream, System.Text.Encoding? encoding = default, int bufferSize = -1, bool leaveOpen = true);
Source
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have a web application, the requirement is we need to load millions of byte array to memory to supply these to one personal sdk method which takes argument as IEnumerable. The problem is to convert such a huge amount of files into bytes arrays(each file to byte[]).There are about 10 million such files. These takes lots of time and memory to load. so how to accomplish this task. Any suggestion would be greatly appreciated.
This is very probably not a good idea.
It's probably best to save your data in files, load a file in memory when you need it, and keep a cache of the n most recently used files. That way you can manage the amount of memory you consume and your server won't be bogged down by what you are doing.
You didn't mention how large the files are, BTW, but file systems are pretty fast now and in combination with that cache, performance will probably be acceptable. I would test this scenario before trying anything funny in-memory.
10 million files of 2 KB each is 20 gigabytes of data. Even if it was in a single file it'd take on the order of three minutes to load at the typical disk transfer speed of 100 megabytes per second. But because you're opening 10 million individual files it's going to take a lot longer.
If those 10 million files are in a single directory it's going to take even longer. NTFS does not perform well when you have that many files in a single directory.
If the files are in a single directory, I'd suggest splitting them up. You're better off having fewer than 10,000 files (and preferably fewer than 1,000) files in a single directory. Create a directory hierarchy to hold the files.
That still leaves you with having to open 10 million individual files. If the data doesn't change often, you should create a single binary file that contains the file names and the associated data. You'd have to recreate that file every time one of the constituent files changes, but you already have to restart your application if one of the files changes.
But all told, I really don't understand why you want to load all this data into memory. If your Web app is going to squirt this down the pipe to some requesting application, the data transfer time will be, at best, the same speed as reading the data from a file. So you're better off having something that reads the data from the file and streams it to the requesting application.
If your application requires that this 20 GB be in memory so that you can send it to the requesting app, then there's probably something seriously wrong with your application design.
One more thing: as I recall, IIS recycles processes from time to time. If your Web app is idle for a long period, then IIS might very well flush it from memory. So the next time somebody makes a request to your application, it will have to reload the data. If you want the data to truly be persistent, you probably want a Windows service that will load the data and keep it in memory. The Web app can query the service for the data when it needs to.
Foreseeable issues:
Performance: Sequential serialization of a large amount of files can be time consuming.
RAM: Payload total size may request critical amounts of memory.
Possible solutions:
Distribute the serialization task. Spawn worker threads, set processor affinity for each in order to evenly distribute workload. Your disk/repository I/O will probably be the bottleneck.
Implement paging. Don't try to load everything in memory. Serialize blocks on demand. As long as your serialization is faster that the required physical network bandwidth, there'll be no 'buffer underrun' situations - in this case, an empty channel waiting for server to answer. That way your process may even start replying faster than if you tried to do full serialization before starting transmitting.
Cache as much as you want, as little as you can. Don't redo costly work.
That said...
...I completely agree with Roy Dictus and the others - seems a very bad model to me.
Yesterday,I asked the question at here:how do disable disk cache in c# invoke win32 CreateFile api with FILE_FLAG_NO_BUFFERING.
In my performance test show(write and read test,1000 files and total size 220M),the FILE_FLAG_NO_BUFFERING can't help me improve performance and lower than .net default disk cache,since i try change FILE_FLAG_NO_BUFFERING to FILE_FLAG_SEQUENTIAL_SCAN can to reach the .net default disk cache and faster little.
before,i try use mongodb's gridfs feature replace the windows file system,not good(and i don't need to use distributed feature,just taste).
in my Product,the server can get a lot of the smaller files(60-100k) on per seconds through tcp/ip,then need save it to the disk,and third service read these files once(just read once and process).if i use asynchronous I/O whether can help me,whether can get best speed and best low cpu cycle?. someone can give me suggestion?or i can still use FileStream class?
update 1
the memory mapped file whether can to achieve my demand.that all files write to one big file or more and read from it?
If your PC is taking 5-10 seconds to write a 100kB file to disk, then you either have the world's oldest, slowest PC, or your code is doing something very inefficient.
Turning off disk caching will probably make things worse rather than better. With a disk cache in place, your writes will be fast, and Windows will do the slow part of flushing the data to disk later. Indeed, increasing I/O buffering usually results in significantly improved I/O in general.
You definitely want to use asynchronous writes - that means your server starts the data writing, and then goes back to responding to its clients while the OS deals with writing the data to disk in the background.
There shouldn't be any need to queue the writes (as the OS will already be doing that if disc caching is enabled), but that is something you could try if all else fails - it could potentially help by writing only one file at a time to minimise the need for disk seeks..
Generally for I/O, using larger buffers helps to increase your throughput. For example instead of writing each individual byte to the file in a loop, write a buffer-ful of data (ideally the entire file, for the sizes you mentioned) in one Write operation. This will minimise the overhead (instead of calling a write function for every byte, you call a function once for the entire file). I suspect you may be doing something like this, as it's the only way I know to reduce performance to the levels you've suggested you are getting.
Memory-mapped files will not help you. They're really best for accessing the contents of huge files.
One of buggest and significant improvements, in your case, can be, imo, process the filles without saving them to a disk and after, if you really need to store them, push them on Queue and provess it in another thread, by saving them on disk. By doing this you will immidiately get processed data you need, without losing time to save a data on disk, but also will have a file on disk after, without losing computational power of your file processor.
I am writing a log backup program in C#. The main objective is to take logs from multiple servers, copy and compress the files and then move them to a central data storage server. I will have to move about 270Gb of data every 24 hours. I have a dedicated server to run this job and a LAN of 1Gbps. Currently I am reading lines from a (text)file, copying them into a buffer stream and writing them to the destination.
My last test copied about 2.5Gb of data in 28 minutes. This will not do. I will probably thread the program for efficiency, but I am looking for a better method to copy the files.
I was also playing with the idea of compressing everything first and then using a stream buffer a bit to copy. Really, I am just looking for a little advice from someone with more experience than me.
Any help is appreciated, thanks.
You first need to profile as Umair said so that you can figure out how much of the 28 minutes is spent compressing vs. transmitting. Also measure the compression rate (bytes/sec) with different compression libraries, and compare your transfer rate against other programs such as Filezilla to see if you're close to your system's maximum bandwidth.
One good library to consider is DotNetZip, which allows you to zip to a stream, which can be handy for large files.
Once you get it fine-tuned for one thread, experiment with several threads and watch your processor utilization to see where the sweet spot is.
One of the solutions can be is what you mantioned: compress files in one Zip file and after transfer them via network. This will bemuch faster as you are transfering one file and often on of principal bottleneck during file transfers is Destination security checks.
So if you use one zip file, there should be one check.
In short:
Compress
Transfer
Decompress (if you need)
This already have to bring you big benefits in terms of performance.
Compress the logs at source and use TransmitFile (that's a native API - not sure if there's a framework equivalent, or how easy it is to P/Invoke this) to send them to the destination. (Possibly HttpResponse.TransmitFile does the same in .Net?)
In any event, do not read your files linewise - read the files in blocks (loop doing FileStream.Read for 4K - say - bytes until read count == 0) and send that direct to the network pipe.
Trying profiling your program... bottleneck is often where you least expect it to be. As some clever guy said "Premature optimisation is the root of all evil".
Once in a similar scenario at work, I was given the task to optimise the process. And after profiling the bottleneck was found to be a call to sleep function (which was used for synchronisation between thread!!!! ).
I have a problem which requires me to parse several log files from a remote machine.
There are a few complications:
1) The file may be in use
2) The files can be quite large (100mb+)
3) Each entry may be multi-line
To solve the in-use issue, I need to copy it first. I'm currently copying it directly from the remote machine to the local machine, and parsing it there. That leads to issue 2. Since the files are quite large copying it locally can take quite a while.
To enhance parsing time, I'd like to make the parser multi-threaded, but that makes dealing with multi-lined entries a bit trickier.
The two main issues are:
1) How do i speed up the file transfer (Compression?, Is transferring locally even neccessary?, Can I read an in use file some other way?)
2) How do i deal with multi-line entries when splitting up the lines among threads?
UPDATE: The reason I didnt do the obvious parse on the server reason is that I want to have as little cpu impact as possible. I don't want to affect the performance of the system im testing.
If you are reading a sequential file you want to read it in line by line over the network. You need a transfer method capable of streaming. You'll need to review your IO streaming technology to figure this out.
Large IO operations like this won't benefit much by multithreading since you can probably process the items as fast as you can read them over the network.
Your other great option is to put the log parser on the server, and download the results.
The better option, from the perspective of performance, is going to be to perform your parsing at the remote server. Apart from exceptional circumstances the speed of your network is always going to be the bottleneck, so limiting the amount of data that you send over your network is going to greatly improve performance.
This is one of the reasons that so many databases use stored procedures that are run at the server end.
Improvements in parsing speed (if any) through the use of multithreading are going to be swamped by the comparative speed of your network transfer.
If you're committed to transferring your files before parsing them, an option that you could consider is the use of on-the-fly compression while doing your file transfer.
There are, for example, sftp servers available that will perform compression on the fly.
At the local end you could use something like libcurl to do the client side of the transfer, which also supports on-the-fly decompression.
The easiest way considering you are already copying the file would be to compress it before copying, and decompress once copying is complete. You will get huge gains compressing text files because zip algorithms generally work very well on them. Also your existing parsing logic could be kept intact rather than having to hook it up to a remote network text reader.
The disadvantage of this method is that you won't be able to get line by line updates very efficiently, which are a good thing to have for a log parser.
I guess it depends on how "remote" it is. 100MB on a 100Mb LAN would be about 8 secs...up it to gigabit, and you'd have it in around 1 second. $50 * 2 for the cards, and $100 for a switch would be a very cheap upgrade you could do.
But, assuming it's further away than that, you should be able to open it with just read mode (as you're reading it when you're copying it). SMB/CIFS supports file block reading, so you should be streaming the file at that point (of course, you didn't actually say how you were accessing the file - I'm just assuming SMB).
Multithreading won't help, as you'll be disk or network bound anyway.
Use compression for transfer.
If your parsing is really slowing you down, and you have multiple processors, you can break the parsing job up, you just have to do it in a smart way -- have a deterministic algorithm for which workers are responsible for dealing with incomplete records. Assuming you can determine that a line is part of a middle of a record, for example, you could break the file into N/M segments, each responsible for M lines; when one of the jobs determines that its record is not finished, it just has to read on until it reaches the end of the record. When one of the jobs determines that it's reading a record for which it doesn't have a beginning, it should skip the record.
If you can copy the file, you can read it. So there's no need to copy it in the first place.
EDIT: use the FileStream class to have more control over the access and sharing modes.
new FileStream("logfile", FileMode.Open, FileAccess.Read, FileShare.ReadWrite)
should do the trick.
I've used SharpZipLib to compress large files before transferring them over the Internet. So that's one option.
Another idea for 1) would be to create an assembly that runs on the remote machine and does the parsing there. You could access the assembly from the local machine using .NET remoting. The remote assembly would need to be a Windows service or be hosted in IIS. That would allow you to keep your copies of the log files on the same machine, and in theory it would take less time to process them.
i think using compression (deflate/gzip) would help
The given answer do not satisfy me and maybe my answer will help others to not think it is super complicated or multithreading wouldn't benefit in such a scenario. Maybe it will not make the transfer faster but depending on the complexity of your parsing it may make the parsing/or analysis of the parsed data faster.
It really depends upon the details of your parsing. What kind of information do you need to get from the log files? Are these information like statistics or are they dependent on multiple log message?
You have several options:
parse multiple files at the same would be the easiest I guess, you have the file as context and can create one thread per file
another option as mentioned before is use compression for the network communication
you could also use a helper that splits the log file into lines that belong together as a first step and then with multiple threads process these blocks of lines; the parsing of this depend lines should be quite easy and fast.
Very important in such a scenario is to measure were your actual bottleneck is. If your bottleneck is the network you wont benefit of optimizing the parser too much. If your parser creates a lot of objects of the same kind you could use the ObjectPool pattern and create objects with multiple threads. Try to process the input without allocating too much new strings. Often parsers are written by using a lot of string.Split and so forth, that is not really as fast as it could be. You could navigate the Stream by checking the coming values without reading the complete string and splitting it again but directly fill the objects you will need after parsing is done.
Optimization is almost always possible, the question is how much you get out for how much input and how critical your scenario is.