Concat performance - c#

I was reading a blog post on msdn about iterators which talks about about how Concat has O(m^2) performance where m is the length of the first IEnumerable. One of the comments, by richard_deeming on the second page, provides some sample code which he says is much faster. I don't really understand why it's faster and was hoping someone could explain it to me.
Thanks.

He's simply saying that instead of using Concat to create an iterator which is actually equivalent to creating an iterator over:
...(((a+b)+c)+d)...
which is caused by:
for (int i = 0; i < length; ++i)
ones = ones.Concat(list);
create a list of iterators you need and return each of those iterators you created previously.
This way you're not ending up with a lot of stacked iterators in the first collection of elements.
Also it's worth mentioning that the claim about O(m^2) is not "really right". It's true in this specific case, but this is like saying + is O(m^2) when you're calculating (((a+b)+c)+d)... case. It's the specific usage pattern that makes it O(m^2).

I don't think the blog post is saying that Concat is O(m^2), at least, it shouldn't be - at one point the fact that Concat is O(m+n) is mentioned - and this is much more believeable. It's the use of Concat in a loop as given on that post that is O(m^2) - and I don't think that this is a particularly shocking finding as you'd expect many calls to multiply up the complexity!
Richard's follow-up is suggesting deferring the Concat operations until they're needed, by storing a list of iterators, and then moving through each of these starting from the first, then when that's exhausted, moving on to the next, which makes perfect sense - however, for 'light usage' Concat as-is would be fine.

Related

Convert loop to LINQ

A list of Equity Analytics (stocks) objects doing a calculation for daily returns.
Was thinking there must be a pairwise solution to do this:
for(int i = 0; i < sdata.Count; i++){
sdata[i].DailyReturn = (i > 0) ? (sdata[i-1].AdjClose/sdata[i].AdjClose) - 1
: 0.0;
}
LINQ stands for: "Language-Integrated Query".
LINQ should not be used and almost can't be used for assignments, LINQ doesn't change the given IEnumerable parameter but creates a new one.
As suggest in a comment below, there is a way to create a new IEnumerable with LINQ, it will be slower, and a lot less readable.
Though LINQ is nice, an important thing is to know when not to use it.
Just use the good old for loop.
I'm new to LINQ, I started using it because of stackoverflow so I tried to play with your question, and I know this may attract downvotes but I tried and it does what you want in case there is at least one element in the list.
sdata[0].DailyReturn = 0.0;
sdata.GetRange(1, sdata.Count - 1).ForEach(c => c.DailyReturn = (sdata[sdata.IndexOf(c)-1].AdjClose / c.AdjClose) - 1);
But must say that avoiding for loops isn't the best practice. from my point of view, LINQ should be used where convenient and not everywhere. Good old loops are sometimes easier to maintain.

Why can LINQ operations be faster than a normal loop?

A friend and I were a bit perplexed during a programming discussion today. As an example, we created a fictive problem of having a List<int> of n random integers (typically 1.000.000) and wanted to create a function that returned the set of all integers that there were more than one of. Pretty straightforward stuff. We created one LINQ statement to solve this problem, and a plain insertion sort based algorithm.
Now, as we tested the speed the code ran at (using System.Diagnostics.StopWatch), the results were confusing. Not only did the LINQ code outperform the simple sort, but it ran faster than a single foreach/for that only did a single loop of the list, and that had no operations within (which, on a side track, I thought the compiler was supposed to discover and remove alltogether).
If we generated a new List<int> of random numbers in the same execution of the program and ran the LINQ code again, the performance would increase by orders of magnitude (typically thousandfold). The performance of the empty loops were of course the same.
So, what is going on here? Is LINQ using parallelism to outperform normal loops? How are these results even possible? LINQ uses quicksort which runs at n*log(n), which per definition is already slower than n.
And what is happening at the performance leap on the second run?
We were both baffled and intrigued at these results and were hoping for some clarifying insights from the community, just to satisfy our own curiosity.
Undoubtedly you haven't actually performed the query, you've merely defined it. LINQ constructs an expression tree that isn't actually evaluated until you perform an operation that requires that the enumeration be iterated. Try adding a ToList() or Count() operation to the LINQ query to force the query to be evaluated.
Based on your comment I expect this is similar to what you've done. Note: I haven't spent any time figuring out if the query is as efficient as possible; I just want some query to illustrate how the code may be structured.
var dataset = ...
var watch = Stopwatch.StartNew();
var query = dataset.Where( d => dataset.Count( i => i == d ) > 1 );
watch.Stop(); // timer stops here
foreach (var item in query) // query is actually evaluated here
{
... print out the item...
}
I would suggest that LINQ is only faster than a 'normal loop' when your algorithm is less than perfect (or you have some problem in your code). So LINQ will be faster at sorting than you are if you don't write an efficient sorting algorithm, etc.
LINQ is usually 'as fast as' or 'close enough to' the speed of a normal loop, and can be faster (and simpler) to code / debug / read. That's its benefit - not execution speed.
If it's performing faster than an empty loop, you are doing something wrong. Most likely, as suggested in comments, you aren't considering deferred execution and the LINQ statement is not actually executing.
If you did not compile with "Optimize Code" enabled, you would probably see this behaviour. (It would certainly explain why the empty loop was not removed.)
The code underlying LINQ, however, is part of already-compiled code, which will certainly have been optimised (by the JIT, NGen or similar).

Is Linq Faster, Slower or the same?

Is this:
Box boxToFind = AllBoxes.FirstOrDefault(box => box.BoxNumber == boxToMatchTo.BagNumber);
Faster or slower than this:
Box boxToFind ;
foreach (Box box in AllBoxes)
{
if (box.BoxNumber == boxToMatchTo.BoxNumber)
{
boxToFind = box;
}
}
Both give me the result I am looking for (boxToFind). This is going to run on a mobile device that I need to be performance conscientious of.
It should be about the same, except that you need to call First (or, to match your code, Last), not Where.
Calling Where will give you a set of matching items (an IEnumerable<Box>); you only want one matching item.
In general, when using LINQ, you need to be aware of deferred execution. In your particular case, it's irrelevant, since you're getting a single item.
The difference is not important unless you've identified that this particular loop as a performance bottleneck through profiling.
If profiling does find it to be a problem, then you'll want to look into alternate storage. Store the data in a dictionary which provides faster lookup than looping through an array.
If micro-optimization is your thing, LINQ performs worse, this is just one article, there are a lot of other posts you can find.
Micro optimization will kill you.
First, finish the whole class, then, if you have performance problems, run a profiler and check for the hotspots of the application.
Make sure you're using the best algorithms you can, then turn to micro optimizations like this.
In case you already did :
Slow -> Fast
LINQ < foreach < for < unsafe for (The last option is not recommended).
Abstractions will make your code slower, 95% of the time.
The fastest is when you are using for loop. But the difference is so small that you are ignore it. It will only matter if you are building a real-time application but then for those applications maybe C# is not the best choice anyway!
If AllBoxes is an IQueryable, it can be faster than the loop, because the queryable could have an optimized implementation of the Where-operation (for example an indexed access).
LINQ is absolutely 100% slower
Depends on what you are trying to accomplish in your program, but for the most part this is most certainly what I would call LAZY PROGRAMMER CODE...
You are going to essentially "stall-out" if you are performing any complex queries, joins etc... total p.o.s for those types of functions/methods- just don't use it. If you do this the hard/long way you will be much happier in the long run...and performance will be a world apart.
NOTE:
I would definitely not recommend LINQ for any program built for speed/synchronization tasks/computation
(i.e. HFT trading &/or AT trading i-0-i for starters).
TESTED:
It took nearly 10 seconds to complete a join in "LINQ" vs. < 1 millisecond.
LINQ vs Loop – A performance test
LINQ: 00:00:04.1052060, avg. 00:00:00.0041052
Loop: 00:00:00.0790965, avg. 00:00:00.0000790
References:
http://ox.no/posts/linq-vs-loop-a-performance-test
http://www.schnieds.com/2009/03/linq-vs-foreach-vs-for-loop-performance.html

Where to draw the line - is it possible to love LINQ too much? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I recently found LINQ and love it. I find lots of occasions where use of it is so much more expressive than the longhand version but a colleague passed a comment about me abusing this technology which now has me second guessing myself. It is my perspective that if a technology works efficiently and the code is elegant then why not use it? Is that wrong? I could spend extra time writing out processes "longhand" and while the resulting code may be a few ms faster, it's 2-3 times more code and therefore 2-3 times more chance that there may be bugs.
Is my view wrong? Should I be writing my code out longhand rather than using LINQ? Isn't this what LINQ was designed for?
Edit: I was speaking about LINQ to objects, I don't use LINQ to XML so much and I have used LINQ to SQL but I'm not so enamoured with those flavours as LINQ to objects.
I have to agree with your view - if it's more efficient to write and elegant then what's a few milliseconds. Writing extra code gives more room for bugs to creep in and it's extra code that needs to be tested and most of all it's extra code to maintain. Think about the guy who's going to come in behind you and maintain your code - they'll thank you for writing elegant easy to read code long before they thank you for writing code that's a few ms faster!
Beware though, this cost of a few ms could be significant when you take the bigger picture into account. If that few milliseconds is part of a loop of thousands of repetitions, then the milliseconds add up fast.
Yes you can love LINQ too much - Single Statement LINQ RayTracer
Where do you draw the line? I'd say use LINQ as much as it makes the code simpler and easier to read.
The moment the LINQ version becomes more difficult to understand then the non-LINQ version it's time to swap, and vice versa. EDIT: This mainly applies to LINQ-To-Objects as the other LINQ flavours have their own benefits.
Its not possible to love Linq to Objects too much, it's a freaking awesome technology !
But seriously, anything that makes your code simple to read, simple to maintain and does the job it was intended for, then you would be silly not to use it as much as you can.
LINQ's supposed to be used to make filtering, sorting, aggregating and manipulating data from various sources as intuitive and expressive as possible. I'd say, use it wherever you feel it's the tidiest, most expressive and most natural syntax for doing what it is you're trying to do, and don't feel guilty about it.
If you start humping the documentation, then it may be time to reconsider your position.
It's cases like these where it's important to remember the golden rules of optimization:
Don't Do It
For Experts: Don't do it yet
You should absolutely not worry about "abusing" linq unless you can indentify it explicitly as the cause of a performance problem
Like anything, it can be abused. As long as you stay away from obvious poor decisions such as
var v = List.Where(...);
for(int i = 0; i < v.Count(); i++)
{...}
and understand how differed execution works, then it is most likely not going to be much slower than the longhand way. According to Anders Hejlsburg (C# architect), the C# compiler is not particularly good at optimizing loops, however it is getting much better at optimizing and parallelizing expression trees. In time, it may be more effective than a loop. The List<>'s ForEach version is actually as fast as a for loop, although I can't find the link that proves that.
P.S. My personal favorite is ForEach<>'s lesser known cousin IndexedForEach (utilizing extension methods)
List.IndexedForEach( (p,i) =>
{
if(i != 3)
p.DoSomething(i);
};
LINQ can be like art. Keep using it to make the code beautiful.
You're answering your own question by talking about writing 2-3 times more code for a few ms of performance. I mean, if your problem domain requires that speedup then yes, if not probably not. However, is it really only a few ms of performance or is it > 5% or > 10%. This is a value judgement based on the individual case.
Where to draw the line?
Well, we already know that it is a bad idea to implement your own quicksort in linq, at least compared to just using linq's orderby.
I've found that using LINQ has speed up my development and made it easier to avoid stupid mistakes that loops can introduce. I have had instances where the performance of LINQ was poor, but that was when I was using it to things like fetch data for an excel file from a tree structure that had millions of nodes.
While I see how there is a point of view that LINQ might make a statement harder to read, I think it is far outweighed by the fact that my methods are now strictly related to the problems that they are solving and not spending time either including lookup loops or cluttering classes with dedicated lookup functions.
It took a little while to get used to doing things with LINQ, since looping lookups, and the like, have been the main option for so long. I look at LINQ as just being another type of syntactic sugar that can do the same task in a more elegant way. Right now, I am still avoiding it in processing-heavy mission critical code - but that is just until the performance improves as LINQ evolves.
My only concern about LINQ is with its implementation of joins.
As I determined when trying to answer this question (and it's confirmed here), the code LINQ generates to perform joins is (necessarily, I guess) naive: for each item in the list, the join performs a linear search through the joined list to find matches.
Adding a join to a LINQ query essentially turns a linear-time algorithm into a quadratic-time algorithm. Even if you think premature optimization is the root of all evil, the jump from O(n) to O(n^2) should give you pause. (It's O(n^3) if you join through a joined item to another collection, too.)
It's relatively easy to work around this. For instance, this query:
var list = from pr in parentTable.AsEnumerable()
join cr in childTable.AsEnumerable() on cr.Field<int>("ParentID") equals pr.Field<int>("ID")
where pr.Field<string>("Value") == "foo"
select cr;
is analogous to how you'd join two tables in SQL Server. But it's terribly inefficient in LINQ: for every parent row that the where clause returns, the query scans the entire child table. (Even if you're joining on an unindexed field, SQL Server will build a hashtable to speed up the join if it can. That's a little outside LINQ's pay grade.)
This query, however:
string fk = "FK_ChildTable_ParentTable";
var list = from cr in childTable.AsEnumerable()
where cr.GetParentRow(fk).Field<string>("Value") == "foo"
select cr;
produces the same result, but it scans the child table once only.
If you're using LINQ to objects, the same issues apply: if you want to join two collections of any significant size, you're probably going to need to consider implementing a more efficient method to find the joined object, e.g.:
Dictionary<Foo, Bar> map = buildMap(foos, bars);
var list = from Foo f in foos
where map[f].baz == "bat"
select f;

Back to basics; for-loops, arrays/vectors/lists, and optimization

I was working on some code recently and came across a method that had 3 for-loops that worked on 2 different arrays.
Basically, what was happening was a foreach loop would walk through a vector and convert a DateTime from an object, and then another foreach loop would convert a long value from an object. Each of these loops would store the converted value into lists.
The final loop would go through these two lists and store those values into yet another list because one final conversion needed to be done for the date.
Then after all that is said and done, The final two lists are converted to an array using ToArray().
Ok, bear with me, I'm finally getting to my question.
So, I decided to make a single for loop to replace the first two foreach loops and convert the values in one fell swoop (the third loop is quasi-necessary, although, I'm sure with some working I could also put it into the single loop).
But then I read the article "What your computer does while you wait" by Gustav Duarte and started thinking about memory management and what the data was doing while it's being accessed in the for-loop where two lists are being accessed simultaneously.
So my question is, what is the best approach for something like this? Try to condense the for-loops so it happens in as little loops as possible, causing multiple data access for the different lists. Or, allow the multiple loops and let the system bring in data it's anticipating. These lists and arrays can be potentially large and looping through 3 lists, perhaps 4 depending on how ToArray() is implemented, can get very costy (O(n^3) ??). But from what I understood in said article and from my CS classes, having to fetch data can be expensive too.
Would anyone like to provide any insight? Or have I completely gone off my rocker and need to relearn what I have unlearned?
Thank you
The best approach? Write the most readable code, work out its complexity, and work out if that's actually a problem.
If each of your loops is O(n), then you've still only got an O(n) operation.
Having said that, it does sound like a LINQ approach would be more readable... and quite possibly more efficient as well. Admittedly we haven't seen the code, but I suspect it's the kind of thing which is ideal for LINQ.
For referemce,
the article is at
What your computer does while you wait - Gustav Duarte
Also there's a guide to big-O notation.
It's impossible to answer the question without being able to see code/pseudocode. The only reliable answer is "use a profiler". Assuming what your loops are doing is a disservice to you and anyone who reads this question.
Well, you've got complications if the two vectors are of different sizes. As has already been pointed out, this doesn't increase the overall complexity of the issue, so I'd stick with the simplest code - which is probably 2 loops, rather than 1 loop with complicated test conditions re the two different lengths.
Actually, these length tests could easily make the two loops quicker than a single loop. You might also get better memory fetch performance with 2 loops - i.e. you are looking at contiguous memory - i.e. A[0],A[1],A[2]... B[0],B[1],B[2]..., rather than A[0],B[0],A[1],B[1],A[2],B[2]...
So in every way, I'd go with 2 separate loops ;-p
Am I understanding you correctly in this?
You have these loops:
for (...){
// Do A
}
for (...){
// Do B
}
for (...){
// Do C
}
And you converted it into
for (...){
// Do A
// Do B
}
for (...){
// Do C
}
and you're wondering which is faster?
If not, some pseudocode would be nice, so we could see what you meant. :)
Impossible to say. It could go either way. You're right, fetching data is expensive, but locality is also important. The first version may be better for data locality, but on the other hand, the second has bigger blocks with no branches, allowing more efficient instruction scheduling.
If the extra performance really matters (as Jon Skeet says, it probably doesn't, and you should pick whatever is most readable), you really need to measure both options, to see which is fastest.
My gut feeling says the second, with more work being done between jump instructions, would be more efficient, but it's just a hunch, and it can easily be wrong.
Aside from cache thrashing on large functions, there may be benefits on tiny functions as well. This applies on any auto-vectorizing compiler (not sure if Java JIT will do this yet, but you can count on it eventually).
Suppose this is your code:
// if this compiles down to a raw memory copy with a bitmask...
Date morningOf(Date d) { return Date(d.year, d.month, d.day, 0, 0, 0); }
Date timestamps[N];
Date mornings[N];
// ... then this can be parallelized using SSE or other SIMD instructions
for (int i = 0; i != N; ++i)
mornings[i] = morningOf(timestamps[i]);
// ... and this will just run like normal
for (int i = 0; i != N; ++i)
doOtherCrap(mornings[i]);
For large data sets, splitting the vectorizable code out into a separate loop can be a big win (provided caching doesn't become a problem). If it was all left as a single loop, no vectorization would occur.
This is something that Intel recommends in their C/C++ optimization manual, and it really can make a big difference.
... working on one piece of data but with two functions can sometimes make it so that code to act on that data doesn't fit in the processor's low level caches.
for(i=0, i<10, i++ ) {
myObject object = array[i];
myObject.functionreallybig1(); // pushes functionreallybig2 out of cache
myObject.functionreallybig2(); // pushes functionreallybig1 out of cache
}
vs
for(i=0, i<10, i++ ) {
myObject object = array[i];
myObject.functionreallybig1(); // this stays in the cache next time through loop
}
for(i=0, i<10, i++ ) {
myObject object = array[i];
myObject.functionreallybig2(); // this stays in the cache next time through loop
}
But it was probably a mistake (usually this type of trick is commented)
When data is cycicly loaded and unloaded like this, it is called cache thrashing, btw.
This is a seperate issue from the data these functions are working on, as typically the processor caches that separately.
I apologize for not responding sooner and providing any kind of code. I got sidetracked on my project and had to work on something else.
To answer anyone still monitoring this question;
Yes, like jalf said, the function is something like:
PrepareData(vectorA, VectorB, xArray, yArray):
listA
listB
foreach(value in vectorA)
convert values insert in listA
foreach(value in vectorB)
convert values insert in listB
listC
listD
for(int i = 0; i < listB.count; i++)
listC[i] = listB[i] converted to something
listD[i] = listA[i]
xArray = listC.ToArray()
yArray = listD.ToArray()
I changed it to:
PrepareData(vectorA, vectorB, ref xArray, ref yArray):
listA
listB
for(int i = 0; i < vectorA.count && vectorB.count; i++)
convert values insert in listA
convert values insert in listB
listC
listD
for(int i = 0; i < listB.count; i++)
listC[i] = listB[i] converted to something
listD[i] = listA[i]
xArray = listC.ToArray()
yArray = listD.ToArray()
Keeping in mind that the vectors can potentially have a large number of items. I figured the second one would be better, so that the program wouldnt't have to loop n times 2 or 3 different times. But then I started to wonder about the affects (effects?) of memory fetching, or prefetching, or what have you.
So, I hope this helps to clear up the question, although a good number of you have provided excellent answers.
Thank you every one for the information. Thinking in terms of Big-O and how to optimize has never been my strong point. I believe I am going to put the code back to the way it was, I should have trusted the way it was written before instead of jumping on my novice instincts. Also, in the future I will put more reference so everyone can understand what the heck I'm talking about (clarity is also not a strong point of mine :-/).
Thank you again.

Categories