I have a C# assembly which processes retail promotions. It is able to process a promotion that has 1,288 qualifying products in just 7 seconds. However, where it is tasked to process a promotion with a larger number of qualifying products then the time taken increases exponentially in relation to the number of products. For example, a promo with 29,962 products takes 7 mins 7 secs and a promo with 77,350 products takes 39 mins and 7 secs.
I've been trying to identify if there's code in the assembly that can be easily optimized. I set the assembly processing the largest of the promotions then attached the performance analyzer to the containing process (BizTalk host instance), the resulted in the following report:
This suggests that the function taking the greatest amount of time is "GetDataPromoLines". This function contains simple string formatting. It is called from the following loop of the function "MapForFF":
foreach (var promoLine in promoLineChunk.PromoLines)
{
outputFile = outputFile + GetDataPromoLines(promoLine, promoLineNumber+1);
promoLineNumber++;
}
The promoLineChunck.PromoLines is a List of a class which describes the promotion, it contains only private strings - one for each column of the database table from which the promotion details were selected. The content of the "GetDataPromoLines" function can be seen below:
private string GetDataPromoLines(VW_BT_PROMOTIONSRECORDSELECT promoLine, int sequenceNumber)
{
StringBuilder sb = new StringBuilder();
string seqNum = sequenceNumber.ToString().PadLeft(5, '0');
string uniqueNumber = promoLine.CIMS_PROMO_NUMBER + seqNum;
sb.AppendLine(string.Format("PromoDiscount,\"{0}\",\"{1}\",\"{2}\",\"{3}\",\"{4}\",\"{5}\",\"{6}\",\"{7}\"",
uniqueNumber,
promoLine.CIMS_PROMO_NAME,
promoLine.TYPE,
promoLine.DESCRIPTION_,
promoLine.DISCOUNTLEVEL,
promoLine.COUPONNUMBERMIN,
promoLine.COUPONNUMBERMAX,
promoLine.COUPONNUMBERLENGTH
));
sb.AppendLine(string.Format("ItemReq,\"{0}\",\"{1}\",\"{2}\",\"{3}\",\"{4}\",\"{5}\",\"{6}\",\"{7}\",\"{8}\"",
"00001",
promoLine.IDENTITYTYPE,
promoLine.ITEMNUM,
promoLine.DIVISIONNUM,
promoLine.DEPARTMENTNUM,
promoLine.DEPTGROUPNUM,
promoLine.CLASSNUM,
promoLine.ITEMGROUPNUM,
promoLine.IR_QUANTITY
));
sb.AppendLine(string.Format("TierDefinition,\"{0}\",\"{1}\",\"{2}\",\"{3}\",\"{4}\",\"{5}\",\"{6}\",\"{7}\",\"{8}\"",
"00001",
promoLine.THRESHOLDTYPE,
promoLine.THRESHOLDQTY,
promoLine.THRESHOLDAMT,
promoLine.DISCTYPE,
promoLine.DISCPCT,
promoLine.DISCAMT,
promoLine.DISCAPPLIESTO,
promoLine.DISCQTY,
promoLine.ADDLINFO
));
return sb.ToString();
}
Can anyone suggest what is causing the exponential increase in time to process? Is it something to do with CLR unboxing?
outputFile = outputFile + GetDataPromoLines(promoLine, promoLineNumber+1);
Is that an attempt to build an entire output file by appending strings? There's your Schlemiel.
For cases like this, you really want to use StringBuilder (or even better, output directly into a file stream using StreamWriter or something):
StringBuilder outputFile;
foreach (var promoLine in promoLineChunk.PromoLines)
{
outputFile.Append(GetDataPromoLines(promoLine, promoLineNumber+1));
promoLineNumber++;
}
The problem with simple appends is that string is immutable in .NET - every time you modify it, it is copied over. For things like outputting huge text files, this is incredibly costly, of course - you spend most of your time copying the parts of the string that didn't change.
The same way, don't do sb.AppendLine(string.Format(...)); - simply use sb.AppendFormat. Ideally, pass the StringBuilder as an argument, to avoid having to copy over the lines themselves - although that should be relatively insignificant performance hit next to the outputFile += ....
As a side-note, be careful when interpreting the results of profiling - it's often subtly misleading. In your case, I'm pretty certain your problem is not in GetDataPromoLines itself (although even that could be improved, as seen above), but in the outputFile += .... It's not enough to just look at the function with the highest exclusive samples. It's also not enough to just look at the hot path, although that's already a huge step-up that usually leads you straight where your attention is needed. Also, understand the difference between sampling and instrumentation - sampling can often lead you to try optimizing a method that's not really a performance problem on its own - rather, it simply shouldn't be called as often as it is. Do not use profiler results as blindfolds - you still need to pay attention to what actually makes sense.
Related
This question already has answers here:
How do I export to Excel?
(3 answers)
Closed 2 years ago.
I have the following loops which are iterating for a long time, the queryResult has 397464 rows and each row has 15 columns, so the number of iterations will be 397464*15 = 5961960 + outer loop (397464) = 6359424 iterations.
The problem is that this is taking a very long time resulting page timeouts.
Could this be written in a more efficient way?
var rowHtml = String.Empty;
foreach (DataRow row in queryResult.Rows)
{
rowHtml += "<tr>";
for (int i = 0; i < queryResult.Columns.Count; i++)
{
rowHtml += $"<td>{row[i]}</td>";
}
rowHtml += "</tr>";
}
Building string: Consider using a StringBuilder. Every time you concatenate strings using the + operator, a new string is created on the heap. This is fine for individual uses, but can be a major slowdown in large workloads like yours. You can specify the StringBuilder's maximum and starting capacities in the constructor, giving you more control over the app's memory usage.
Parallelization: I do not know your app's exact context, but I suggest having a look at the System.Threading.Parallel class. Its For/Foreach methods allow you to iterate over a collection using a threadpool, which can greatly accelerate processing by shifting it to multiple cores.
Be careful though: If the order of elements is relevant, you should divide the workload into packages instead and build substrings for each of those.
Edit: Correction: String concatenation can only truly be parallelized in some rare cases where the exact length of each substring produced by the loop is fixed and known. In this special case, it is possible to write results directly to a large pre-allocated destination buffer. This is perfectly viable when working with char arrays or pointers, but not advisable with normal C# strings or StringBuilders.
Asynchronous Processing: It looks like you are writing some kind of web app or server backend. If your content is required on demand, and does not need to be ready the exact moment the page is loaded, consider displaying a loading bar or some notification along the lines of "please wait", while the page waits for your server to send the finished processing results.
Edit: As suggested in comments, there are better ways to solve this issue than constructing the HTML string from a table. Consider using those instead of some elaborate content loading scheme.
So a professor in university just told me that using concatenation on strings in C# (i.e. when you use the plus sign operator) creates memory fragmentation, and that I should use string.Format instead.
Now, I've searched a lot in stack overflow and I found a lot of threads about performance, which concatenating strings win hands down. (Some of them include this, this and this)
I can't find someone who talks about memory fragmentation though. I opened .NET's string.Format using ILspy and apparently it uses the same string builder than the string.Concat method does (which if I understand is what the + sign is overloaded to). In fact: it uses the code in string.Concat!
I found this article from 2007 but I doubt it's accurate today (or ever!). Apparently the compiler is smart enough to avoid that today, cause I can't seem to reproduce the issue. Both adding strings with string.format and plus signs end up using the same code internally. As said before, the string.Format uses the same code string.Concat uses.
So now I'm starting to doubt his claim. Is it true?
So a professor in university just told me that using concatenation on strings in C# (i.e. when you use the plus sign operator) creates memory fragmentation, and that I should use string.Format instead.
No, what you should do instead is do user research, set user-focussed real-world performance metrics, and measure the performance of your program against those metrics. When, and only when you find a performance problem, you should use the appropriate profiling tools to determine the cause of the performance issue. If the cause is "memory fragmentation" then address that by identifying the causes of the "fragmentation" and trying experiments to determine what techniques mitigate the effect.
Performance is not achieved by "tips and tricks" like "avoid string concatenation". Performance is achieved by applying engineering discipline to realistic problems.
To address your more specific problem: I have never heard the advice to eschew concatenation in favor of formatting for performance reasons. The advice usually given is to eschew iterated concatenation in favor of builders. Iterated concatenation is quadratic in time and space and creates collection pressure. Builders allocate unnecessary memory but are linear in typical scenarios. Neither creates fragmentation of the managed heap; iterated concatenation tends to produce contiguous blocks of garbage.
The number of times I've had a performance problem that came down to unnecessary fragmentation of a managed heap is exactly one; in an early version of Roslyn we had a pattern where we would allocate a small long lived object, then a small short lived object, then a small long lived object... several hundred thousand times in a row, and the resulting maximally fragmented heap caused user-impacting performance problems on collections; we determined this by careful measurement of the performance in the relevant scenarios, not by ad hoc analysis of the code from our comfortable chairs.
The usual advice is not to avoid fragmentation, but rather to avoid pressure. We found during the design of Roslyn that pressure was far more impactful on GC performance than fragmentation, once our aforementioned allocation pattern problem was fixed.
My advice to you is to either press your professor for an explanation, or to find a professor who has a more disciplined approach to performance metrics.
Now, all that said, you should use formatting instead of concatenation, but not for performance reasons. Rather, for code readability, localizability, and similar stylistic concerns. A format string can be made into a resource, it can be localized, and so on.
Finally, I caution you that if you are putting strings together in order to build something like a SQL query or a block of HTML to be served to a user, then you want to use none of these techniques. These applications of string building have serious security impacts when you get them wrong. Use libraries and tools specifically designed for construction of those objects, rather than rolling your own with strings.
The problem with string concatenation is that strings are immutable. string1 + string2 does not concatenate string2 onto string1, it creates a whole new string. Using a StringBuilder (or string.Format) does not have this problem. Internally, the StringBuilder holds a char[], which it over-allocates. Appending something to a StringBuilder does not create any new objects unless it runs out of room in the char[] (in which case it over-allocates a new one).
I ran a quick benchmark. I think it proves the point :)
StringBuilder sb = new StringBuilder();
string st;
Stopwatch sw;
sw = Stopwatch.StartNew();
for (int i = 0 ; i < 100000 ; i++)
{
sb.Append("a");
}
st = sb.ToString();
sw.Stop();
Debug.WriteLine($"Elapsed: {sw.Elapsed}");
st = "";
sw = Stopwatch.StartNew();
for (int i = 0 ; i < 100000 ; i++)
{
st = st + "a";
}
sw.Stop();
Debug.WriteLine($"Elapsed: {sw.Elapsed}");
The console output:
Elapsed: 00:00:00.0011883 (StringBuilder.Append())
Elapsed: 00:00:01.7791839 (+ operator)
I have a string array of about 20,000,000 values.
And i need to convert it to a string
I've tried:
string data = "";
foreach (var i in tm)
{
data = data + i;
}
But that takes too long time
does someone know a faster way?
Try StringBuilder:
StringBuilder sb = new StringBuilder();
foreach (var i in tm)
{
sb.Append(i);
}
To get the resulting String use ToString():
string result = sb.ToString();
The answer is going to depend on the size of the output string and the amount of memory you have available and usable. The hard limit on string length appears to be 2^31-1 (int.MaxValue) characters, occupying just over 4GB of memory. Whether you can actually allocate that is dependent on your framework version, etc. If you're going to be producing a larger output then you can't put it into a single string anyway.
You've already discovered that naive concatenation is going to be tragically slow. The problem is that every pass through the loop creates a new string, then immediately discards it on the next iteration. This is going to fill up memory pretty quickly, forcing the Garbage Collector to work overtime finding old strings to clear out of memory, not to mention the amount of memory fragmentation and all that stuff that modern programmers don't pay much attention to.
A StringBuiler, is a reasonable solution. Internally it allocates blocks of characters that it then stitches together at the end using pointers and memory copies. Saves a lot of hassles that way and is quite speedy.
As for String.Join... it uses a StringBuilder. So does String.Concat although it is certainly quicker when not inserting separator characters.
For simplicity I would use String.Concat and be done with it.
But then I'm not much for simplicity.
Here's an untested and possibly horribly slow answer using LINQ. When I get time I'll test it and see how it performs, but for now:
string result = new String(lines.SelectMany(l => (IEnumerable<char>)l).ToArray());
Obviously there is a potential overflow here since the ToArray call can potentially create an array larger than the String constructor can handle. Try it out and see if it's as quick as String.Concat.
So you can do it in LINQ, like such.
string data = tm.Aggregate("", (current, i) => current + i);
Or you can use the string.Join function
string data = string.Join("", tm);
Cant check it right now but I'm curious on how this option would perform:
var data = String.Join(string.Empty, tm);
Is Join optimized and ignores concatenation a with String.Empty?
For this big data unfortunately memory based methods will fail and this will be a real headache for GC. For this operation create a file and put every string in it. Like this:
using (StreamWriter sw = new StreamWriter("some_file_to_write.txt")){
for (int i=0; i<tm.Length;i++)
sw.Write(tm[i]);
}
Try to avoid using "var" on this performance demanding approach. Correction: "var" does not effect perfomance. "dynamic" does.
I am creating a method in C# which generates a text file for a Google Product Feed. The feed will contain upwards of 30,000 records and the text file currently weighs in at ~7Mb.
Here's the code I am currently using (some lines removed for brevity's sake).
public static void GenerateTextFile(string filePath) {
var sb = new StringBuilder(1000);
sb.Append("availability").Append("\t");
sb.Append("condition").Append("\t");
sb.Append("description").Append("\t");
// repetitive code hidden for brevity ...
sb.Append(Environment.NewLine);
var items = inventoryRepo.GetItemsForSale();
foreach (var p in items) {
sb.Append("in stock").Append("\t");
sb.Append("used").Append("\t");
sb.Append(p.Description).Append("\t");
// repetitive code hidden for brevity ...
sb.AppendLine();
}
using (StreamWriter outfile = new StreamWriter(filePath)) {
result.Append("Writing text file to disk.").AppendLine();
outfile.Write(sb.ToString());
}
}
I am wondering if StringBuilder is the right tool for the job. Would there be performance gains if I used a TextWriter instead?
I don't know a ton about IO performance so any help or general improvements would be appreciated. Thanks.
File I/O operations are generally well optimized in modern operating systems. You shouldn't try to assemble the entire string for the file in memory ... just write it out piece by piece. The FileStream will take care of buffering and other performance considerations.
You can make this change easily by moving:
using (StreamWriter outfile = new StreamWriter(filePath)) {
to the top of the function, and getting rid of the StringBuilder writing directly to the file instead.
There are several reasons why you should avoid building up large strings in memory:
It can actually perform worse, because the StringBuilder has to increase its capacity as you write to it, resulting in reallocation and copying of memory.
It may require more memory than you can physically allocate - which may result in the use of virtual memory (the swap file) which is much slower than RAM.
For truly large files (> 2Gb) you will run out of address space (on 32-bit platforms) and will fail to ever complete.
To write the StringBuilder contents to a file you have to use ToString() which effectively doubles the memory consumption of the process since both copies must be in memory for a period of time. This operation may also fail if your address space is sufficiently fragmented, such that a single contiguous block of memory cannot be allocated.
Just move the using statement so it encompasses the whole of your code, and write directly to the file. I see no point in keeping it all in memory first.
Write one string at a time using StreamWriter.Write rather than caching everything in a StringBuilder.
This might be old but I had a file to write with about 17 million lines
so I ended up batching the writes every 10k lines similar to these lines
for (i6 = 1; i6 <= ball; i6++)
{ //this is middle of 6 deep nest ..
counter++;
// modus to get a value at every so often 10k lines
divtrue = counter % 10000; // remainder operator % for 10k
// build the string of fields with \n at the end
lineout = lineout + whatever
// the magic 10k block here
if (divtrue.Equals(0))
{
using (StreamWriter outFile = new StreamWriter(#filepath, true))
{
// write the 10k lines with .write NOT writeline..
outFile.Write(lineout);
}
// reset the string so we dont do silly like memory overflow
lineout = "";
}
}
In my case it was MUCH faster then one line at a time.
I've been trying to deal with some delimited text files that have non standard delimiters (not comma/quote or tab delimited). The delimiters are random ASCII characters that don't show up often between the delimiters. After searching around, I've seem to have only found no solutions in .NET will suit my needs and the custom libraries that people have written for this seem to have some flaws when it comes to gigantic input (4GB file with some field values having very easily several million characters).
While this seems to be a bit extreme, it is actually a standard in the Electronic Document Discovery (EDD) industry for some review software to have field values that contain the full contents of a document. For reference, I've previously done this in python using the csv module with no problems.
Here's an example input:
Field delimiter =
quote character = þ
þFieldName1þþFieldName2þþFieldName3þþFieldName4þ
þValue1þþValue2þþValue3þþSomeVery,Very,Very,Large value(5MB or so)þ
...etc...
Edit:
So I went ahead and created a delimited file parser from scratch. I'm kind of weary using this solution as it may be prone to bugs. It also doesn't feel "elegant" or correct to have to write my own parser for a task like this. I also have a feeling that I probably didn't have to write a parser from scratch for this anyway.
Use the File Helpers API. It's .NET and open source. It's extremely high performance using compiled IL code to set fields on strongly typed objects, and supports streaming.
It supports all sorts of file types and custom delimiters; I've used it to read files larger than 4GB.
If for some reason that doesn't do it for you, try just reading line by line with a string.split:
public IEnumerable<string[]> CreateEnumerable(StreamReader input)
{
string line;
while ((line = input.ReadLine()) != null)
{
yield return line.Split('þ');
}
}
That'll give you simple string arrays representing the lines in a streamy fashion that you can even Linq into ;) Remember however that the IEnumerable is lazy loaded, so don't close or alter the StreamReader until you've iterated (or caused a full load operation like ToList/ToArray or such - given your filesize however, I assume you won't do that!).
Here's a good sample use of it:
using (StreamReader sr = new StreamReader("c:\\test.file"))
{
var qry = from l in CreateEnumerable(sr).Skip(1)
where l[3].Contains("something")
select new { Field1 = l[0], Field2 = l[1] };
foreach (var item in qry)
{
Console.WriteLine(item.Field1 + " , " + item.Field2);
}
}
Console.ReadLine();
This will skip the header line, then print out the first two field from the file where the 4th field contains the string "something". It will do this without loading the entire file into memory.
Windows and high performance I/O means, use IO Completion ports. You may have todo some extra plumbing to get it working in your case.
This is with the understanding that you want to use C#/.NET, and according to Joe Duffy
18) Don’t use Windows Asynchronous Procedure Calls (APCs) in managed
code.
I had to learn that one the hard way ;), but ruling out APC use, IOCP is the only sane option. It also supports many other types of I/O, frequently used in socket servers.
As far as parsing the actual text, check out Eric White's blog for some streamlined stream use.
I would be inclined to use a combination of Memory Mapped Files (msdn point to a .NET wrapper here) and a simple incremental parse, yielding back to an IEnumerable list of your record / text line (or whatever)
You mention that some fields are very very big, if you try to read them in their entirety to memory you may be getting yourself into trouble. I would read through the file in 8K (or small chunks), parse the current buffer, keep track of state.
What are you trying to do with this data that you are parsing? Are you searching for something? Are you transforming it?
I don't see a problem with you writing a custom parser. The requirements seem sufficiently different to anything already provided by the BCL, so go right ahead.
"Elegance" is obviously a subjective thing. In my opinion, if your parser's API looks and works like a standard BCL "reader"-type API, then that is quite "elegant".
As for the large data sizes, make your parser work by reading one byte at a time and use a simple state machine to work out what to do. Leave the streaming and buffering to the underlying FileStream class. You should be OK with performance and memory consumption.
Example of how you might use such a parser class:
using(var reader = new EddReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192)) {
// Read a small field
string smallField = reader.ReadFieldAsText();
// Read a large field
Stream largeField = reader.ReadFieldAsStream();
}
While this doesn't help address the large input issue, a possible solution to the parsing issue might include a custom parser that users the strategy pattern to supply a delimiter.