How to efficiently write a large text file in C#?

How to efficiently write a large text file in C#? - c#

I am creating a method in C# which generates a text file for a Google Product Feed. The feed will contain upwards of 30,000 records and the text file currently weighs in at ~7Mb.
Here's the code I am currently using (some lines removed for brevity's sake).
public static void GenerateTextFile(string filePath) {
var sb = new StringBuilder(1000);
sb.Append("availability").Append("\t");
sb.Append("condition").Append("\t");
sb.Append("description").Append("\t");
// repetitive code hidden for brevity ...
sb.Append(Environment.NewLine);
var items = inventoryRepo.GetItemsForSale();
foreach (var p in items) {
sb.Append("in stock").Append("\t");
sb.Append("used").Append("\t");
sb.Append(p.Description).Append("\t");
// repetitive code hidden for brevity ...
sb.AppendLine();
}
using (StreamWriter outfile = new StreamWriter(filePath)) {
result.Append("Writing text file to disk.").AppendLine();
outfile.Write(sb.ToString());
}
}
I am wondering if StringBuilder is the right tool for the job. Would there be performance gains if I used a TextWriter instead?
I don't know a ton about IO performance so any help or general improvements would be appreciated. Thanks.

File I/O operations are generally well optimized in modern operating systems. You shouldn't try to assemble the entire string for the file in memory ... just write it out piece by piece. The FileStream will take care of buffering and other performance considerations.
You can make this change easily by moving:
using (StreamWriter outfile = new StreamWriter(filePath)) {
to the top of the function, and getting rid of the StringBuilder writing directly to the file instead.
There are several reasons why you should avoid building up large strings in memory:
It can actually perform worse, because the StringBuilder has to increase its capacity as you write to it, resulting in reallocation and copying of memory.
It may require more memory than you can physically allocate - which may result in the use of virtual memory (the swap file) which is much slower than RAM.
For truly large files (> 2Gb) you will run out of address space (on 32-bit platforms) and will fail to ever complete.
To write the StringBuilder contents to a file you have to use ToString() which effectively doubles the memory consumption of the process since both copies must be in memory for a period of time. This operation may also fail if your address space is sufficiently fragmented, such that a single contiguous block of memory cannot be allocated.

Just move the using statement so it encompasses the whole of your code, and write directly to the file. I see no point in keeping it all in memory first.

Write one string at a time using StreamWriter.Write rather than caching everything in a StringBuilder.

This might be old but I had a file to write with about 17 million lines
so I ended up batching the writes every 10k lines similar to these lines
for (i6 = 1; i6 <= ball; i6++)
{ //this is middle of 6 deep nest ..
counter++;
// modus to get a value at every so often 10k lines
divtrue = counter % 10000; // remainder operator % for 10k
// build the string of fields with \n at the end
lineout = lineout + whatever
// the magic 10k block here
if (divtrue.Equals(0))
{
using (StreamWriter outFile = new StreamWriter(#filepath, true))
{
// write the 10k lines with .write NOT writeline..
outFile.Write(lineout);
}
// reset the string so we dont do silly like memory overflow
lineout = "";
}
}
In my case it was MUCH faster then one line at a time.

Related

Performance issues while creating file checksums

I am writing a console application which iterates through a binary tree and searches for new or changed files based on their md5 checksums.
The whole process is acceptable fast (14sec for ~70.000 files) but generating the checksums takes about 5min which is quite too slow...
Any suggestions for improving this process? My hash function is the following:
private string getMD5(string filename)
{
using (var md5 = new MD5CryptoServiceProvider())
{
if (File.Exists(#filename))
{
try
{
var buffer = md5.ComputeHash(File.ReadAllBytes(filename));
var sb = new StringBuilder();
for (var i = 0; i < buffer.Length; i++)
{
sb.Append(buffer[i].ToString("x2"));
}
return sb.ToString();
}
catch (Exception)
{
Program.logger.log("Error while creating checksum!", Program.logger.LOG_ERROR);
return "";
}
}
else
{
return "";
}
}
}

Well, accepted answer is not valid, because, of course, there is a ways to improve your code performance. It is valid for some other thoughts however)
Main stopper here, except disk I/O, is memory allocation. Here the some thoughts that should improve speed:
Do not read entire file in memory for calculation, it is slow, and it'll produce a lot of memory pressure via LOH objects. Instead open file as a stream, and calculate Hash by chunks.
The reason, why you have slowdown when using ComputeHash stream override, because internally it use very small buffer (4kb), so choose appropriate buffer size (256kb or more, optimal value to be found by experimenting)
Use TransformBlock and TransformFinalBlock functions to calculate hash value. You can pass null for outputBuffer parameter.
Reuse that buffer for following files hash calculations, so there is no need for additional allocations.
Additionally you can reuse MD5CryptoServiceProvider, but benefits are questionable.
And the last, you can apply async pattern for reading chunks from stream, so OS will read next chunk from disk on the same time, when you calculating partial hash for previous chunk. Of course such code is more difficult to write, and you'll need at least two buffers (reuse them as well), but it can provide great impact on speed.
As a minor improvement, do not check for file existence. I believe, that your function called from some enumeration, and there is very little chance, that file is deleted meanwhile.
All above is valid for medium to large sized files. If you, instead, have a lot of very small files, you can speed calculation by processing files in parallel. Actually parallelization can also help with large files, but it is up to be measured.
And the last, if collisions doesn't bother you too much, you can chose less expensive hash algorithm, CRC, for example.

In order to create the Hash, you have to read every last byte of the file. So this operation is Disk-limited, not CPU limited and scales proportionally to the size of files. Multithreading will not help.
Unless the FS can somehow calculate and store the hash for you, there is just no way to speed this up. You are dependant on what the FS does for you to track changes.
Generally proramms that check for "changed files" (like backup routines) do not calculate the Hashvalue for exactly that reason. They may still calculate and store it for validation purposes, but that is it.
Unless the user does some serious (NTFS driver loading level) sabotage, the "last changed" date with the filesize is enough to detect changes. Maybe also check the archive bit, but that one is rarely used nowadays.
A minor improovement for these kind of scenarios (list files and process them) is using "Enumerate Files" rather then list files. But at 14 seconds Listing/5 minutes processing that will just not have any relevant effect.

Understanding VS performance analysis

I have a C# assembly which processes retail promotions. It is able to process a promotion that has 1,288 qualifying products in just 7 seconds. However, where it is tasked to process a promotion with a larger number of qualifying products then the time taken increases exponentially in relation to the number of products. For example, a promo with 29,962 products takes 7 mins 7 secs and a promo with 77,350 products takes 39 mins and 7 secs.
I've been trying to identify if there's code in the assembly that can be easily optimized. I set the assembly processing the largest of the promotions then attached the performance analyzer to the containing process (BizTalk host instance), the resulted in the following report:
This suggests that the function taking the greatest amount of time is "GetDataPromoLines". This function contains simple string formatting. It is called from the following loop of the function "MapForFF":
foreach (var promoLine in promoLineChunk.PromoLines)
{
outputFile = outputFile + GetDataPromoLines(promoLine, promoLineNumber+1);
promoLineNumber++;
}
The promoLineChunck.PromoLines is a List of a class which describes the promotion, it contains only private strings - one for each column of the database table from which the promotion details were selected. The content of the "GetDataPromoLines" function can be seen below:
private string GetDataPromoLines(VW_BT_PROMOTIONSRECORDSELECT promoLine, int sequenceNumber)
{
StringBuilder sb = new StringBuilder();
string seqNum = sequenceNumber.ToString().PadLeft(5, '0');
string uniqueNumber = promoLine.CIMS_PROMO_NUMBER + seqNum;
sb.AppendLine(string.Format("PromoDiscount,\"{0}\",\"{1}\",\"{2}\",\"{3}\",\"{4}\",\"{5}\",\"{6}\",\"{7}\"",
uniqueNumber,
promoLine.CIMS_PROMO_NAME,
promoLine.TYPE,
promoLine.DESCRIPTION_,
promoLine.DISCOUNTLEVEL,
promoLine.COUPONNUMBERMIN,
promoLine.COUPONNUMBERMAX,
promoLine.COUPONNUMBERLENGTH
));
sb.AppendLine(string.Format("ItemReq,\"{0}\",\"{1}\",\"{2}\",\"{3}\",\"{4}\",\"{5}\",\"{6}\",\"{7}\",\"{8}\"",
"00001",
promoLine.IDENTITYTYPE,
promoLine.ITEMNUM,
promoLine.DIVISIONNUM,
promoLine.DEPARTMENTNUM,
promoLine.DEPTGROUPNUM,
promoLine.CLASSNUM,
promoLine.ITEMGROUPNUM,
promoLine.IR_QUANTITY
));
sb.AppendLine(string.Format("TierDefinition,\"{0}\",\"{1}\",\"{2}\",\"{3}\",\"{4}\",\"{5}\",\"{6}\",\"{7}\",\"{8}\"",
"00001",
promoLine.THRESHOLDTYPE,
promoLine.THRESHOLDQTY,
promoLine.THRESHOLDAMT,
promoLine.DISCTYPE,
promoLine.DISCPCT,
promoLine.DISCAMT,
promoLine.DISCAPPLIESTO,
promoLine.DISCQTY,
promoLine.ADDLINFO
));
return sb.ToString();
}
Can anyone suggest what is causing the exponential increase in time to process? Is it something to do with CLR unboxing?

outputFile = outputFile + GetDataPromoLines(promoLine, promoLineNumber+1);
Is that an attempt to build an entire output file by appending strings? There's your Schlemiel.
For cases like this, you really want to use StringBuilder (or even better, output directly into a file stream using StreamWriter or something):
StringBuilder outputFile;
foreach (var promoLine in promoLineChunk.PromoLines)
{
outputFile.Append(GetDataPromoLines(promoLine, promoLineNumber+1));
promoLineNumber++;
}
The problem with simple appends is that string is immutable in .NET - every time you modify it, it is copied over. For things like outputting huge text files, this is incredibly costly, of course - you spend most of your time copying the parts of the string that didn't change.
The same way, don't do sb.AppendLine(string.Format(...)); - simply use sb.AppendFormat. Ideally, pass the StringBuilder as an argument, to avoid having to copy over the lines themselves - although that should be relatively insignificant performance hit next to the outputFile += ....
As a side-note, be careful when interpreting the results of profiling - it's often subtly misleading. In your case, I'm pretty certain your problem is not in GetDataPromoLines itself (although even that could be improved, as seen above), but in the outputFile += .... It's not enough to just look at the function with the highest exclusive samples. It's also not enough to just look at the hot path, although that's already a huge step-up that usually leads you straight where your attention is needed. Also, understand the difference between sampling and instrumentation - sampling can often lead you to try optimizing a method that's not really a performance problem on its own - rather, it simply shouldn't be called as often as it is. Do not use profiler results as blindfolds - you still need to pay attention to what actually makes sense.

What consumes less resources and is faster File.AppendText or File.WriteAllText storing first in StringBuilder?

I have to write thousands of dynamically generated lines to a text file.
I have two choices, Which consumes less resources and is faster than the other?
A. Using StringBuilder and File.WriteAllText
StringBuilder sb = new StringBuilder();
foreach(Data dataItem in Datas)
{
sb.AppendLine(
String.Format(
"{0}, {1}-{2}",
dataItem.Property1,
dataItem.Property2,
dataItem.Property3));
}
File.WriteAllText("C:\\example.txt", sb.ToString(), new UTF8Encoding(false));
B. Using File.AppendText
using(StreamWriter sw = File.AppendText("C:\\example.txt"))
{
foreach (Data dataItem in Datas)
{
sw.WriteLine(
String.Format(
"{0}, {1}-{2}",
dataItem.Property1,
dataItem.Property2,
dataItem.Property3));
}
}

Your first version, which puts everything into a StringBuilder and then writes it, will consume the most memory. If the text is very large, you have the potential of running out of memory. It has the potential to be faster, but it could also be slower.
The second option will use much less memory (basically, just the StreamWriter buffer), and will perform very well. I would recommend this option. It performs well--possibly better than the first method--and doesn't have the same potential for running out of memory.
You can speed it quite a lot by increasing the size of the output buffer. Rather than
File.AppendText("filename")
Create the stream with:
const int BufferSize = 65536; // 64 Kilobytes
StreamWriter sw = new StreamWriter("filename", true, Encoding.UTF8, BufferSize);
A buffer size of 64K gives much better performance than the default 4K buffer size. You can go larger, but I've found that larger than 64K gives minimal performance gains, and on some systems can actually decrease performance.

You do have at least one other choice, using File.AppendAllLines()
var data = from item in Datas
select string.Format("{0}, {1}-{2}", item.Property1, item.Property2, item.Property3);
File.AppendAllLines("Filename", data, new UTF8Encoding(false));
This will theoretically use less memory than your first approach since only one line at a time will be buffered in memory.
It will probably be almost exactly the same as your second approach though. I'm just showing you a third alternative. The only advantage of this one is that you can feed it a Linq sequence, which can be useful sometimes.
The I/O speed will dwarf any other considerations, so you should concentrate on minimising memory usage as juharr noted above (and also considering the dangers of premature optimisation, of course!)
That means using your second approach, or the one I put here.

When should I slurp a file, and when should I read it by-line?

Imagine that I have a C# application that edits text files. The technique employed for each file can be either:
1) Read the file at once in to a string, make the changes, and write the string over the existing file:
string fileContents = File.ReadAllText(fileName);
// make changes to fileContents here...
using (StreamWriter writer = new StreamWriter(fileName))
{
writer.Write(fileContents);
}
2) Read the file by line, writing the changes to a temp file, then deleting the source and renaming the temp file:
using (StreamReader reader = new StreamReader(fileName))
{
string line;
using (StreamWriter writer = new StreamWriter(fileName + ".tmp"))
{
while (!reader.EndOfStream)
{
line = reader.ReadLine();
// make changes to line here
writer.WriteLine(line);
}
}
}
File.Delete(fileName);
File.Move(fileName + ".tmp", fileName);
What are the performance considerations with these options?
It seems to me that either reading by line or reading the entire file at once, the same quantity of data will be read, and disk times will dominate the memory alloc times. That said, once a file is in memory, the OS is free to page it back out, and when it does so the benefit of that large read has been lost. On the other hand, when working wit a temporary file, once the handles are closed I need to delete the old file and rename the temp file, which incurs a cost.
Then there are questions around caching, and prefetching, and disk buffer sizes...
I am assuming that in some cases, slurping the file is better, and in others, operating by line is better. My question is, what are the conditions for these two cases?

in some cases, slurping the file is better, and in others, operating by line is better.
Very nearly; except that reading line-by-line is actually a much more specific case. The actual choices we want to distinguish between are ReadAll and using a buffer. ReadLine makes assumptions - the biggest one being that the file actually has lines, and they are a reasonable length! If we can't make this assumption about the file, we want to choose a specific buffer size and read into that, regardless of whether we've reached the end of a line or not.
So deciding between reading it all at once and using a buffer - always go with the easiest to implement, and most naive approach until you run into a specific situation that does not work for you - and having a concrete case, you can make an educated decision based on the information you actually have, rather than speculating about hypothetical situations.
Simplest - read it all at once.
Is performance becoming a problem? Does this application run against uncontrolled files, so their size is not predictable? Just a few examples where you want to chunk it.

StringDictionary To TextFile

I am trying to export a stringdictionary to a text file, it has over one million of records, it takes over 3 minutes to export into a textfile if I use a loop.
Is there a way to do that faster?
Regards

Well, it depends on what format you're using for the export, but in general, the biggest overhead for exporting large amounts of data is going to be I/O. You can reduce this by using a more compact data format, and by doing less manipulation of the data in memory (to avoid memory copies) if possible.
The first thing to check is to look at your disk I/O speed and do some profiling of the code that does the writing.
If you're maxing out your disk I/O (e.g., writing at a good percentage of disk speed, which would be many tens of megabytes per second on a modern system), you could consider compressing the data before you write it. This uses more CPU, but you write less to the disk when you do this. This will also likely increase the speed of reading the file, if you have the same bottleneck on the reading side.
If you're maxing out your CPU, you need to do less processing work on the data before writing it. If you're using a serialization library, for example, avoiding that and switching to a simpler, more specialized data format might help. Consider the simplest format you need: probably just a word for the length of the string, followed by the string data itself, repeated for every key and value.

Note that most dictionary constructs don't preserve the insert order - this often makes them poor choices if you want repeatable file contents, but (depending on the size) we may be able to improve on the time.... this (below) takes about 3.5s (for the export) to write just under 30MB:
StringDictionary data = new StringDictionary();
Random rand = new Random(123456);
for (int i = 0; i < 1000000; i++)
{
data.Add("Key " + i, "Value = " + rand.Next());
}
Stopwatch watch = Stopwatch.StartNew();
using (TextWriter output = File.CreateText("foo.txt"))
{
foreach (DictionaryEntry pair in data)
{
output.Write((string)pair.Key);
output.Write('\t');
output.WriteLine((string)pair.Value);
}
output.Close();
}
watch.Stop();
Obviously the performance will depend on the size of the actual data getting written.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.