This question already has answers here:
How do I export to Excel?
(3 answers)
Closed 2 years ago.
I have the following loops which are iterating for a long time, the queryResult has 397464 rows and each row has 15 columns, so the number of iterations will be 397464*15 = 5961960 + outer loop (397464) = 6359424 iterations.
The problem is that this is taking a very long time resulting page timeouts.
Could this be written in a more efficient way?
var rowHtml = String.Empty;
foreach (DataRow row in queryResult.Rows)
{
rowHtml += "<tr>";
for (int i = 0; i < queryResult.Columns.Count; i++)
{
rowHtml += $"<td>{row[i]}</td>";
}
rowHtml += "</tr>";
}
Building string: Consider using a StringBuilder. Every time you concatenate strings using the + operator, a new string is created on the heap. This is fine for individual uses, but can be a major slowdown in large workloads like yours. You can specify the StringBuilder's maximum and starting capacities in the constructor, giving you more control over the app's memory usage.
Parallelization: I do not know your app's exact context, but I suggest having a look at the System.Threading.Parallel class. Its For/Foreach methods allow you to iterate over a collection using a threadpool, which can greatly accelerate processing by shifting it to multiple cores.
Be careful though: If the order of elements is relevant, you should divide the workload into packages instead and build substrings for each of those.
Edit: Correction: String concatenation can only truly be parallelized in some rare cases where the exact length of each substring produced by the loop is fixed and known. In this special case, it is possible to write results directly to a large pre-allocated destination buffer. This is perfectly viable when working with char arrays or pointers, but not advisable with normal C# strings or StringBuilders.
Asynchronous Processing: It looks like you are writing some kind of web app or server backend. If your content is required on demand, and does not need to be ready the exact moment the page is loaded, consider displaying a loading bar or some notification along the lines of "please wait", while the page waits for your server to send the finished processing results.
Edit: As suggested in comments, there are better ways to solve this issue than constructing the HTML string from a table. Consider using those instead of some elaborate content loading scheme.
Related
I have a large (700MB+) CSV file that I am processing with PLINQ. Here's the query:
var q = from r in ReadRow(src).AsParallel()
where BoolParser.Parse(r[vacancyIdx])
select r[apnIdx];
It generates a list of APNs for vacant properties if you are wondering.
My question is, How can I extract a stream of "bad records" without doing 2 passes on the query/stream?
Each line in the CSV file should contain colCount records. I would like to enforce this by changing the where clause to "where r.Count == colCount && BoolParser.Parse(r[vacancyIdx])".
But, then any malformed input is going to silently disappear.
I need to capture any malformed lines in an error log and flag that n lines of input were not processed.
Currently I do this work in the ReadRow() function, but it seems like there ought to be a plinqy way to split a stream of data into 2 or more streams to be processed.
Anyone out there know how to do this? If not, does anyone know how to get this suggestion added to the PLINQ new feature requests? ;-)
What you're asking for doesn't make much sense, because PLINQ is based on a "pull" model (i.e. the consumer decides when to consume an item). Consider code like (using C# 7 tuple syntax for brevity):
var (good, bad) = ReadRow(src).AsParallel().Split(r => r.Count == colCount);
foreach (var item in bad)
{
// do something
}
foreach (var item in good)
{
// do something else
}
The implementation of Split has two options:
Block one stream when the current item belongs to the other stream.
In the above example, this would cause deadlock as soon as the first good item would appear.
Cache the values of one stream while the other stream is being read.
In the above example, assuming the vast majority of items are good, this would cause about 700 MB of your data to be kept in memory during the moment between the two foreach loops. So this is also undesirable.
So, I think your solution of doing it in ReadRow is okay.
Another option would be something like:
where CheckCount(r) && BoolParser.Parse(r[vacancyIdx])
Here, the CheckCount method reports any errors it finds and returns false for them. (If you do this, make sure to make the reporting thread-safe.)
If you still want to propose adding something like this to PLINQ, or just discuss the options, you can create an issue in the corefx repository.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have an array filled with protein id's and I have the protein name and id in a dictionary. I want to print every protein that it id is on the array. This is the method I am using to do that ,but it is not fast enough to do a large number of proteins.
string txt= "****** ID : {0} , Protien Name : {1} ******";
for (int i = 0; i < codonarray.Length; i++)
{
if (codonarray[i] != null)
{
if (dictionaryproteins.TryGetValue(codonarray[i], out myvalue))
{
Console.WriteLine(txt, codonarray[i], myvalue.name);
}
}
}
The dictionary lookup you are using has O(1) performance (see: Big O notation). This means that the time required for one lookup is near constant and does not depend on the size of the dictionary. It does not matter, whether you have 10 or 10,000,000 items in the dictionary.
You want to print each element of the array. So you need to loop through the array anyway. You could hide the loop in some LINQ construct, but the loop is still there and cannot be optimized away. This is a O(n) operation.
This means that you are already doing the matching in an optimum way. The combined performance (looping + lookup) is O(n). You cannot do better.
If you do I/O operations, then this is most probably the source of your performance problems. Try to minimize them. Use buffering, caching etc.
Print only a summary or only every n-th line. If printing alone takes too much time, how much time will it take to read the output?
The console is relatively slow to print to. Write the data to a file instead of the console.
string[] fileContent = File.ReadAllLines(#"Phrases\Phrases.txt");
then code what filecontent is, which in your case would the protein information.
An example of a file of 8,980 lines i use this for speech recognition to print to a text box and you can change it as you see fit. It prints really fast.
string[] fileContent0 = File.ReadAllLines(#"Phrases\Phrases.txt");
_recognizer.LoadGrammarAsync(new Grammar(new GrammarBuilder(new Choices(fileContent0))));
I have a C# assembly which processes retail promotions. It is able to process a promotion that has 1,288 qualifying products in just 7 seconds. However, where it is tasked to process a promotion with a larger number of qualifying products then the time taken increases exponentially in relation to the number of products. For example, a promo with 29,962 products takes 7 mins 7 secs and a promo with 77,350 products takes 39 mins and 7 secs.
I've been trying to identify if there's code in the assembly that can be easily optimized. I set the assembly processing the largest of the promotions then attached the performance analyzer to the containing process (BizTalk host instance), the resulted in the following report:
This suggests that the function taking the greatest amount of time is "GetDataPromoLines". This function contains simple string formatting. It is called from the following loop of the function "MapForFF":
foreach (var promoLine in promoLineChunk.PromoLines)
{
outputFile = outputFile + GetDataPromoLines(promoLine, promoLineNumber+1);
promoLineNumber++;
}
The promoLineChunck.PromoLines is a List of a class which describes the promotion, it contains only private strings - one for each column of the database table from which the promotion details were selected. The content of the "GetDataPromoLines" function can be seen below:
private string GetDataPromoLines(VW_BT_PROMOTIONSRECORDSELECT promoLine, int sequenceNumber)
{
StringBuilder sb = new StringBuilder();
string seqNum = sequenceNumber.ToString().PadLeft(5, '0');
string uniqueNumber = promoLine.CIMS_PROMO_NUMBER + seqNum;
sb.AppendLine(string.Format("PromoDiscount,\"{0}\",\"{1}\",\"{2}\",\"{3}\",\"{4}\",\"{5}\",\"{6}\",\"{7}\"",
uniqueNumber,
promoLine.CIMS_PROMO_NAME,
promoLine.TYPE,
promoLine.DESCRIPTION_,
promoLine.DISCOUNTLEVEL,
promoLine.COUPONNUMBERMIN,
promoLine.COUPONNUMBERMAX,
promoLine.COUPONNUMBERLENGTH
));
sb.AppendLine(string.Format("ItemReq,\"{0}\",\"{1}\",\"{2}\",\"{3}\",\"{4}\",\"{5}\",\"{6}\",\"{7}\",\"{8}\"",
"00001",
promoLine.IDENTITYTYPE,
promoLine.ITEMNUM,
promoLine.DIVISIONNUM,
promoLine.DEPARTMENTNUM,
promoLine.DEPTGROUPNUM,
promoLine.CLASSNUM,
promoLine.ITEMGROUPNUM,
promoLine.IR_QUANTITY
));
sb.AppendLine(string.Format("TierDefinition,\"{0}\",\"{1}\",\"{2}\",\"{3}\",\"{4}\",\"{5}\",\"{6}\",\"{7}\",\"{8}\"",
"00001",
promoLine.THRESHOLDTYPE,
promoLine.THRESHOLDQTY,
promoLine.THRESHOLDAMT,
promoLine.DISCTYPE,
promoLine.DISCPCT,
promoLine.DISCAMT,
promoLine.DISCAPPLIESTO,
promoLine.DISCQTY,
promoLine.ADDLINFO
));
return sb.ToString();
}
Can anyone suggest what is causing the exponential increase in time to process? Is it something to do with CLR unboxing?
outputFile = outputFile + GetDataPromoLines(promoLine, promoLineNumber+1);
Is that an attempt to build an entire output file by appending strings? There's your Schlemiel.
For cases like this, you really want to use StringBuilder (or even better, output directly into a file stream using StreamWriter or something):
StringBuilder outputFile;
foreach (var promoLine in promoLineChunk.PromoLines)
{
outputFile.Append(GetDataPromoLines(promoLine, promoLineNumber+1));
promoLineNumber++;
}
The problem with simple appends is that string is immutable in .NET - every time you modify it, it is copied over. For things like outputting huge text files, this is incredibly costly, of course - you spend most of your time copying the parts of the string that didn't change.
The same way, don't do sb.AppendLine(string.Format(...)); - simply use sb.AppendFormat. Ideally, pass the StringBuilder as an argument, to avoid having to copy over the lines themselves - although that should be relatively insignificant performance hit next to the outputFile += ....
As a side-note, be careful when interpreting the results of profiling - it's often subtly misleading. In your case, I'm pretty certain your problem is not in GetDataPromoLines itself (although even that could be improved, as seen above), but in the outputFile += .... It's not enough to just look at the function with the highest exclusive samples. It's also not enough to just look at the hot path, although that's already a huge step-up that usually leads you straight where your attention is needed. Also, understand the difference between sampling and instrumentation - sampling can often lead you to try optimizing a method that's not really a performance problem on its own - rather, it simply shouldn't be called as often as it is. Do not use profiler results as blindfolds - you still need to pay attention to what actually makes sense.
EDIT:
#Everyone Sorry, I feel silly getting mixed up with the size of int32. Question could be closed, but since there are several answers already, I selected the first one.
Original question is below for reference
I am looking for a way to load a specific line from very large textfiles and I was planning on using File.ReadLines and the Skip() method:
File.ReadLines(fileName).Skip(nbLines).Take(1).ToArray();
Problem is, Skip() takes an int value, and int values are limited to 2 million or so. Should be fine for most files, but what if the file contains, say 20 million lines? I tried using a long, but no overload of Skip() accepts longs.
Lines are of variable, unknown length so I can't count the bytes.
Is there an option that doesn't involve reading line by line or splitting the file in chunks? This operation must be very fast.
Integers are 32-bit numbers, and so are limited to 2 billion or so.
That said, if you have to read a random line from the file, and all you know is that the file has lines, you will have to read it line by line until you reach the line you want. You can use some buffers to ease up on the I/O a little bit (they're on by default), but you won't get any better performance than that.
Unless you change the way the file is saved. If you could create an index file, containing the position of each line the main file, you can make reading a line infinitely faster.
Well, not infinitely, a but a lot faster - from O(N) to almost O(1) (almost, because seeking to a random byte in a file may not be an O(1) operation, depending on how the OS does it).
I voted to close your question because your premises are incorrect. However, were this a real problem, there's nothing to stop you writing your own Skip extension method that takes a long instead of an int:
public static class SkipEx
{
public static IEnumerable<T> LongSkip<T>(this IEnumerable<T> src,
long numToSkip)
{
long counter = 0L;
foreach(var item in src)
{
if(counter++ < numToSkip)continue;
yield return item;
}
}
}
so now you can do such craziness as
File.GetLines(filename).LongSkip(100000000000L)
without problems (and come back next year...). Tada!
Int values are limited to around 2 billion not two million. So unless your file is going to have more than around 2.4 billion lines, you should be fine.
You always can use SkipWhile and TakeWhile, and write your own predicates
I was working on some code recently and came across a method that had 3 for-loops that worked on 2 different arrays.
Basically, what was happening was a foreach loop would walk through a vector and convert a DateTime from an object, and then another foreach loop would convert a long value from an object. Each of these loops would store the converted value into lists.
The final loop would go through these two lists and store those values into yet another list because one final conversion needed to be done for the date.
Then after all that is said and done, The final two lists are converted to an array using ToArray().
Ok, bear with me, I'm finally getting to my question.
So, I decided to make a single for loop to replace the first two foreach loops and convert the values in one fell swoop (the third loop is quasi-necessary, although, I'm sure with some working I could also put it into the single loop).
But then I read the article "What your computer does while you wait" by Gustav Duarte and started thinking about memory management and what the data was doing while it's being accessed in the for-loop where two lists are being accessed simultaneously.
So my question is, what is the best approach for something like this? Try to condense the for-loops so it happens in as little loops as possible, causing multiple data access for the different lists. Or, allow the multiple loops and let the system bring in data it's anticipating. These lists and arrays can be potentially large and looping through 3 lists, perhaps 4 depending on how ToArray() is implemented, can get very costy (O(n^3) ??). But from what I understood in said article and from my CS classes, having to fetch data can be expensive too.
Would anyone like to provide any insight? Or have I completely gone off my rocker and need to relearn what I have unlearned?
Thank you
The best approach? Write the most readable code, work out its complexity, and work out if that's actually a problem.
If each of your loops is O(n), then you've still only got an O(n) operation.
Having said that, it does sound like a LINQ approach would be more readable... and quite possibly more efficient as well. Admittedly we haven't seen the code, but I suspect it's the kind of thing which is ideal for LINQ.
For referemce,
the article is at
What your computer does while you wait - Gustav Duarte
Also there's a guide to big-O notation.
It's impossible to answer the question without being able to see code/pseudocode. The only reliable answer is "use a profiler". Assuming what your loops are doing is a disservice to you and anyone who reads this question.
Well, you've got complications if the two vectors are of different sizes. As has already been pointed out, this doesn't increase the overall complexity of the issue, so I'd stick with the simplest code - which is probably 2 loops, rather than 1 loop with complicated test conditions re the two different lengths.
Actually, these length tests could easily make the two loops quicker than a single loop. You might also get better memory fetch performance with 2 loops - i.e. you are looking at contiguous memory - i.e. A[0],A[1],A[2]... B[0],B[1],B[2]..., rather than A[0],B[0],A[1],B[1],A[2],B[2]...
So in every way, I'd go with 2 separate loops ;-p
Am I understanding you correctly in this?
You have these loops:
for (...){
// Do A
}
for (...){
// Do B
}
for (...){
// Do C
}
And you converted it into
for (...){
// Do A
// Do B
}
for (...){
// Do C
}
and you're wondering which is faster?
If not, some pseudocode would be nice, so we could see what you meant. :)
Impossible to say. It could go either way. You're right, fetching data is expensive, but locality is also important. The first version may be better for data locality, but on the other hand, the second has bigger blocks with no branches, allowing more efficient instruction scheduling.
If the extra performance really matters (as Jon Skeet says, it probably doesn't, and you should pick whatever is most readable), you really need to measure both options, to see which is fastest.
My gut feeling says the second, with more work being done between jump instructions, would be more efficient, but it's just a hunch, and it can easily be wrong.
Aside from cache thrashing on large functions, there may be benefits on tiny functions as well. This applies on any auto-vectorizing compiler (not sure if Java JIT will do this yet, but you can count on it eventually).
Suppose this is your code:
// if this compiles down to a raw memory copy with a bitmask...
Date morningOf(Date d) { return Date(d.year, d.month, d.day, 0, 0, 0); }
Date timestamps[N];
Date mornings[N];
// ... then this can be parallelized using SSE or other SIMD instructions
for (int i = 0; i != N; ++i)
mornings[i] = morningOf(timestamps[i]);
// ... and this will just run like normal
for (int i = 0; i != N; ++i)
doOtherCrap(mornings[i]);
For large data sets, splitting the vectorizable code out into a separate loop can be a big win (provided caching doesn't become a problem). If it was all left as a single loop, no vectorization would occur.
This is something that Intel recommends in their C/C++ optimization manual, and it really can make a big difference.
... working on one piece of data but with two functions can sometimes make it so that code to act on that data doesn't fit in the processor's low level caches.
for(i=0, i<10, i++ ) {
myObject object = array[i];
myObject.functionreallybig1(); // pushes functionreallybig2 out of cache
myObject.functionreallybig2(); // pushes functionreallybig1 out of cache
}
vs
for(i=0, i<10, i++ ) {
myObject object = array[i];
myObject.functionreallybig1(); // this stays in the cache next time through loop
}
for(i=0, i<10, i++ ) {
myObject object = array[i];
myObject.functionreallybig2(); // this stays in the cache next time through loop
}
But it was probably a mistake (usually this type of trick is commented)
When data is cycicly loaded and unloaded like this, it is called cache thrashing, btw.
This is a seperate issue from the data these functions are working on, as typically the processor caches that separately.
I apologize for not responding sooner and providing any kind of code. I got sidetracked on my project and had to work on something else.
To answer anyone still monitoring this question;
Yes, like jalf said, the function is something like:
PrepareData(vectorA, VectorB, xArray, yArray):
listA
listB
foreach(value in vectorA)
convert values insert in listA
foreach(value in vectorB)
convert values insert in listB
listC
listD
for(int i = 0; i < listB.count; i++)
listC[i] = listB[i] converted to something
listD[i] = listA[i]
xArray = listC.ToArray()
yArray = listD.ToArray()
I changed it to:
PrepareData(vectorA, vectorB, ref xArray, ref yArray):
listA
listB
for(int i = 0; i < vectorA.count && vectorB.count; i++)
convert values insert in listA
convert values insert in listB
listC
listD
for(int i = 0; i < listB.count; i++)
listC[i] = listB[i] converted to something
listD[i] = listA[i]
xArray = listC.ToArray()
yArray = listD.ToArray()
Keeping in mind that the vectors can potentially have a large number of items. I figured the second one would be better, so that the program wouldnt't have to loop n times 2 or 3 different times. But then I started to wonder about the affects (effects?) of memory fetching, or prefetching, or what have you.
So, I hope this helps to clear up the question, although a good number of you have provided excellent answers.
Thank you every one for the information. Thinking in terms of Big-O and how to optimize has never been my strong point. I believe I am going to put the code back to the way it was, I should have trusted the way it was written before instead of jumping on my novice instincts. Also, in the future I will put more reference so everyone can understand what the heck I'm talking about (clarity is also not a strong point of mine :-/).
Thank you again.