optimize large csv file writer - large table datareader using TPL PLINQ

optimize large csv file writer - large table datareader using TPL PLINQ - c#

Any tips on how I can optimize below further using TPL and/or PLINQ.
Below code runs on a background worker
Read a large table using sql reader
Open stream writer to write a large csv file
while (reader.read())
{
massage the data, parse data from columns etc.
create csv string to write to file
write csv line to file
}
close reader
close file
Thank you.

You might find better performance by writing the csv line data to a StringBuilder (I.E., in memory) then writing the contents out to your csv file. I would suggest using both methods along with a memory profiler like ANTS or the JetBrains product.

define "optimize further"... Do you want more speed or less memory use?
Assuming above pseudo code is correctly implemented then memory use should be already pretty minimal.
Speed? Based on the statement that that you are working with a large data set, then the data reader would be your biggest source of slowness. So if you really wanted to use parallel processing, then you'd have to fragment your data set (presumably open multiple readers?)
But then again, you are already running it in a background worker, so does it really matter?

Related

C# Reuse StreamWriter or FileStream but change destination file

A little background...
Everything I'm about to describe up to my implementation of the StreamWriter is business processes which I cannot change.
Every month I pull around 200 different tables of data into individual files.
Each file contains roughly 400,000 lines of business logic details for upwards of 5,000-6,000 different business units.
To effectively use that data with the tools on hand, I have to break down those files into individual files for each business unit...
200 files x 5000 business units per file = 100,000 different files.
The way I've BEEN doing it is the typical StreamWriter loop...
foreach(string SplitFile in List<BusinessFiles>)
{
using (StreamWriter SW = new StreamWriter(SplitFile))
{
foreach(var BL in g)
{
string[] Split1 = BL.Split(',');
SW.WriteLine("{0,-8}{1,-8}{2,-8}{3,-8}{4,-8}{5,-8}{6,-8}{7,-8}{8,-8}{9,-16}{10,-1}",
Split1[0], Split1[1], Split1[2], Split1[3], Split1[4], Split1[5], Split1[6], Split1[7], Split1[8], Convert.ToDateTime(Split1[9]).ToString("dd-MMM-yyyy"), Split1[10]);
}
}
}
The issue with this is, It takes an excessive amount of time.
Like, It can take 20 mins to process all the files sometimes.
Profiling my code shows me that 98% of the time spent is on the system disposing of the StreamWriter after the program leaves the loop.
So my question is......
Is there a way to keep the underlying Stream open and reuse it to write a different file?
I know I can Flush() the Stream but I can't figure out how to get it to start writing to another file altogether. I can't seem to find a way to change the destination filename without having to call another StreamWriter.
Edit:
A picture of what it shows when I profile the code
Edit 2:
So after poking around a bit more I started looking at it a different way.
First thing is, I already had the reading of the one file and writing of the massive amount of smaller files in a nested parallel loop so I was essentially maxing out my I/O as is.
I'm also writing to an SSD, so all those were good points.
Turns out I'm reading the 1 massive file and writing ~5600 smaller ones every 90 seconds or so.
That's 60 files a second. I guess I can't really ask for much more than that.

This sounds about right. 100,000 files in 20 minutes is more than 83 files every second. Disk I/O is pretty much the slowest thing you can do within a single computer. All that time in the Dispose() method is waiting for the buffer to flush out to disk while closing the file... it's the actual time writing the data to your persistent storage, and a separate using block for each file is the right way to make sure this is done safely.
To speed this up it's tempting to look at asynchronous processing (async/await), but I don't think you'll find any gains there; ultimately this is an I/O-bound task, so optimizing for your CPU scheduling might even make things worse. Better gains could be available if you can change the output to write into a single (indexed) file, so the operating system's disk buffering mechanism can be more efficient.

I would agree with Joel that the time is mostly due to writing the data out to disk. I would however be a little bit more optimistic about doing parallel IO, since SSDs are better able to handle higher loads than regular HDDs. So I would try a few things:
1. Doing stuff in parallel
Change your outer loop to a parallel one
Parallel.ForEach(
myBusinessFiles,
new ParallelOptions(){MaxDegreeOfParallelism = 2},
SplitFile => {
// Loop body
});
Try changing the degree of parallelism to see if performance improves or not. This assumes the data is thread safe.
2. Try writing high speed local SSD
I'm assuming you are writing to a network folder, this will add some additional latency, so you might try to write to a local disk. If you are already doing that, consider getting a faster disk. If you need to move all the files to a network drive afterwards, you will likely not gain anything, but it can give an idea about the penalty you get from the network.
3. Try writing to a Zip Archive
There are zip archives that can contain multiple files inside it, while still allowing for fairly easy access of an individual file. This could help improve performance in a few ways:
Compression. I would assume your data is fairly easy to compress, so you would write less data overall.
Less file system operations. Since you are only writing to a single file you would avoid some overhead with the file system.
Reduced overhead due to cluster size. Files have a minimum size, this can cause a fair bit of wasted space for small files. Using an archive avoids this.
You could also try saving each file in an individual zip-archive, but then you would mostly benefit from the compression.

Responding to your question, you have an option (add a flag on the constructor) but it is strongly tied to the garbage collector, also think about multi thread environment it could be a mess. That said this is the overloaded constructor:
StreamWriter(Stream, Encoding, Int32, Boolean)
Initializes a new instance of the StreamWriter class for the specified stream by using the specified encoding and buffer size, and optionally leaves the stream open.
public StreamWriter (System.IO.Stream stream, System.Text.Encoding? encoding = default, int bufferSize = -1, bool leaveOpen = true);
Source

Writing continuously to file vs saving data in an array and writing to file all at once

I am working in c# and in my program i am currently opening a file stream at launch and writing data to it in csv format about every second. I am wondering, would it be more efficient to store this data in an arraylist and write it all at once at the end, or continue to keep an open filestream and just write the data every second?

If the amount of data is "reasonably manageable" in memory, then write the data a the end.
If this is continuous, I wonder id an option could be to use something like NLog to write your csv (create a specific log format) as that manages writes pretty efficiently. You would also need to set it to raise exceptions if there was an error.

You should consider using a BufferedStream instead. Write to the stream and allow the framework to flush to file as necessary. Just make sure to flush the stream before closing it.

From what I learned in Operating Systems writing to a file is a lot more expensive than writing to memory. However, your stream is most likely going to be cached. Which means that under the hood, all that file writing you are doing is happening in memory. The operating system handles all the actual writing to file asynchronously when it's the right time. Depending on your applications there is no need to worry about such micro-optimizations.
You can read more about why most languages take this approach under the hood here https://unix.stackexchange.com/questions/224415/whats-the-philosophy-behind-delaying-writing-data-to-disk

This kind of depends on your specific case. If you're writing data about once per second it seems likely that you're not going to see much of an impact from writing directly.
In general writing to a FileStream in small pieces is quite performant because the .NET Framework and the OS handle buffering for you. You won't see the file itself being updated until the buffer fills up or you explicitly flush the stream.
Buffering in memory isn't a terrible idea for smallish data and shortish periods. Of course if your program throws an exception or someone kills it before it writes to the disk then you lose all that information, which is probably not your favourite thing.
If you're worried about performance then use a logging thread. Post objects to it through a ConcurrentQueue<> or similar and have it do all the writes on a separate thread. Obviously threaded logging is more complex. It's not something I'd advise unless you really, really need the extra performance.
For quick-and-dirty logging I generally just use File.AppendAllText() or File.AppendAllLines() to push the data out. It takes a bit longer, but it's pretty reliable. And I can read the output while the program is still running, which is often useful.

Manipulating large set of CSV data in memory

I am trying to manipulate a large set(10 million record) of data that I have imported into a datatable. I don't think the datatable is the most efficient way of manipulating a large set of data in memory. Does anyone have a better way of doing this? What I am trying to do is taking the contents of a CSV file manipulate some of the data and re-export the results into another CSV file.
TIA,
Paul

A DataTable will require loading the whole thing into memory at once. Don't do that. Instead treat both the in and out csv files as streams. Here's a really good CSV reader that will allow you to read and work on one record at a time:
A Fast CSV Reader
http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader
You want to open the source for reading and the destination for writing at the same time. As you read a line from the source, process it, and then write to the destination. You should never have more than a line or a few lines in memory. This will be far more efficient both in terms of memory usage and performance.
For higher performance you could use separate reading/writing threads and a producer/consumer queue, but that takes a lot more management to ensure the queue doesn't get saturated and depending on the situation and relative read/process/write performance, this more complex solution may not increase performance at all.

Fast CSV reader

I am working on a project, which actually loads data from a CSV file, processes it and then save it on the disk. For fast reading of CSV data, I am using Lumenworks CSV reader http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader. This works fine till a limit but when I have CSV with a size of 1GB or more, it takes time. Is there any other way for faster CSV reading?

Not a lot of info provided... so on the assumption that this is an IO limitation your options are:
Get Faster Storage [e.g. SSD, RAID].
Try compression - sometimes the time spent in compression [e.g Zip] saves multiples in IO.
Try threading - particularly useful if doing computationally hard calculations - but probably a bad fit in this scenario.
Change the problem - do you need to read/write a 1GB file? Maybe you can change the data format [156 is a lot smaller than "156,", maybe you only need to deal with smaller blocks, maybe the time taken honestly doesn't matter etc.
Any others?
Hmm, you could try caching the output, I've experimented with MemoryMappedFiles & RAM Drives... you could do it with some simple threading... while this does potentially return sooner, it has huge risks and complexities

Fast writing bytearray to file

I use _FileStream.Write(_ByteArray, 0, _ByteArray.Length); to write a bytearray to a file. I noticed that's very slow.
I read a line from a text file, convert it to a bytearray and then need to write it to a new (large > 500 Mb) file. Please some advice to speed up the write process.

FileStream.Write is basically what there is. It's possible that using a BufferedStream would help, but unlikely.
If you're really reading a single line of text which, when encoded, is 500MB then I wouldn't be surprised to find that most of the time is being spent performing encoding. You should be able to test that by doing the encoding and then throwing away the result.
Assuming the "encoding" you're performing is just Encoding.GetBytes(string), you might want to try using a StreamWriter to wrap the FileStream - it may work better by tricks like repeatedly encoding into the same array before writing to the file.
If you're actually reading a line at a time and appending that to the file, then:
Obviously it's best if you keep both the input stream and output stream open throughout the operation. Don't repeatedly read and then write.
You may get better performance using multiple threads or possibly asynchronous IO. This will partly depend on whether you're reading from and writing to the same drive.
Using a StreamWriter is probably still a good idea.
Additionally when creating the file, you may want to look at using a constructor which accepts a FileOptions. Experiment with the available options, but I suspect you'll want SequentialScan and possibly WriteThrough.

If your writing nothing but Byte arrays, have you tried using BinaryWriter's Write method? Writing in bulk would probably also help with the speed. Perhaps you can read each line, convert the string to its bytes, store those bytes for a future write operation (i.e in a List or something), and every so often (after reading x lines) write a chunk to the disk.
BinaryWriter: http://msdn.microsoft.com/en-us/library/ms143302.aspx

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.