IL optimization for JIT compilers

IL optimization for JIT compilers - c#

I am developing a compiler that emits IL code. It is important that the resulting IL is JIT'ted to the fastest possible machine codes by Mono and Microsoft .NET JIT compilers.
My questions are:
Does it make sense to optimize patterns like:
'stloc.0; ldloc.0; ret' => 'ret'
'ldc.i4.0; conv.r8' => 'ldc.r8.0'
and such, or are the JIT's smart enough to take care of these?
Is there a specification with the list of optimizations performed by Microsoft/Mono JIT compilers?
Is there any good read with practical recommendations / best practices to optimize IL so that JIT compilers can in turn generate the most optimal machine code (performance-wise)?

The two patterns yo described are the easy stuff that the JIT actually gets right (except for non-primitive structs). In SSA form constant propagation and elimination of dead values is very easy.
No, you have to test what the JIT can do. Look into compiler literature to see what standard optimizations to expect. Then, test for them. The two JITs that we have right now optimize very little and sometimes do not get the most basic stuff right. For example, MyStruct s; s.x = 1; s.x = 1; is not optimized by RyuJIT. s = s; isn't either. s.x + s.x loads x twice from memory. Expect little.
You need to understand what machine code basic operations map to. This is not too complicated. Try a few things and look at the disassembly listing. You'll quickly get a feel for what the output is going to look like.

Redundant conversions and load/stores like that are a pretty inevitable side-effect of a recursive decent parser. You can technically get rid of them with a peephole optimizer. But it is nothing to worry about, the C# and VB.NET compilers generate them as well.
The existing .NET/Mono jitters are very good at optimizing them away. They focus on optimizing the code that really matters for execution speed, the machine code. With the very nice advantage that anybody that writes a compiler that generates IL automatically benefits from these optimizations without having to do anything special.
Jitter optimizations are covered in this post.

Related

DivideByZeroException compiler check complexity: easier or harder in MSIL vs C# or no difference?

This is a question related to this fascinating question about detecting divide by zero exceptions at compile time.
From Eric Lippert's answer, this is non-trivial to achieve properly (which I suppose is why it's not provided already).
My question is:
Is the level of difficulty of doing these types of checks the same regardless of the "level" of the language e.g. higher level vs lower level?
Specifically, the C# compiler converts C# to MSIL. Would these types of checks be easier or harder at the MSIL level as part of some kind of second pass check?
Or, does the language itself make very little difference at all?
Reading the gotchas listed in Eric's answer, I would assume the checks would have to be the same in any language? For example, you can have jumps in lots of languages and would therefore need to implement the flow checking Eric describes...?
Just to keep this question specific, would this kind of check be easier or harder in MSIL than it is in C#?

This is a very interesting and deep question -- though perhaps one not well suited to this site.
The question, if I understand it, is what the impact is on the choice of language to analyze when doing static analysis in pursuit of defects; should an analyzer look at IL, or should it look at the source code? Note that I've broadened this question from the original narrow focus on divide-by-zero defects.
The answer is, of course: it depends. Both techniques are commonly used in the static analysis industry, and there are pros and cons of each. It depends on what defects you're looking for, what techniques you are using to prune false paths, suppress false positives and deduce defects, and how you intend to surface discovered defects to developers.
Analyzing bytecode has some clear benefits over source code. The chief one is: if you have a bytecode analyzer for Java bytecode, you can run Scala through it without ever writing a Scala analyzer. If you have an MSIL analyzer, you can run C# or VB or F# through it without writing analyzers for each language.
There are also benefits to analyzing code at the bytecode level. Analyzing control flow is very easy when you have bytecode because you can very quickly organize chunks of bytecode into "basic blocks"; a basic block is a region of code where there is no instruction which branches into its middle, and every normal exit from the block is at its bottom. (Exceptions can of course happen anywhere.) By breaking up bytecode into basic blocks we can compute a graph of blocks that branch to each other, and then summarize each block in terms of its action on local and global state. Bytecode is useful because it is an abstraction over code that shows at a lower level what is really happening.
That of course is also its major shortcoming; bytecode loses information about the intentions of the developer. Any defect checker which requires information from source code in order to detect the defect or prevent a false positive is going to give poor results when run on bytecode. Consider for example a C program:
#define DOBAR if(foo)bar();
...
if (blah)
DOBAR
else
baz();
If this horrid code were lowered to machine code or bytecode then all we would see is a bunch of branch instructions and we'd have no idea that we ought to be reporting a defect here, that the else binds to the if(foo) and not the if(blah) as the developer intends.
The dangers of the C preprocessor are well known. But there are also great difficulties imposed when doing analysis of complex lowered code at the bytecode level. For example, consider something like C#:
async Task Foo(Something x)
{
if (x == null) return;
await x.Bar();
await x.Blah();
}
Plainly x cannot be dereferenced as null here. But C# is going to lower this to some absolutely crazy code; part of that code is going to look something like this:
int state = 0;
Action doit = () => {
switch(state) {
case 0:
if (x == null) {
state = -1;
return;
};
state = 1;
goto case 1:
case 1:
Task bar = x.Bar();
state = 2;
if (<bar is a completed task>) {
goto case 2;
} else {
<assign doit as the completion of bar>
return;
}
case 2:
And so on. (Except that it is much, much more complicated than that.) This will then be lowered into even more abstract bytecode; imagine trying to understand this code at the level of switches being lowered to gotos and delegates lowered into closures.
A static analyzer analyzing the equivalent bytecode would be perfectly within its rights to say "plainly x can be null because we check for it on one branch of the switch; this is evidence that x must be checked for nullity on other branches, and it is not, therefore I will give a null dereference defect on the other branches".
But that would be a false positive. We know something that the static analyzer might not, namely, that the zero state must execute before every other state, and that when the coroutine is resumed x will always have been checked for null already. That's apparent from the original source code but would be very difficult to tease out from the bytecode.
What then do you do, if you wish to get the benefits of bytecode analysis without the drawbacks? There are a variety of techniques; for example, you could write your own intermediate language that was higher level than bytecode -- that has high-level constructs like "yield" or "await", or "for loop" -- write an analyzer that analyzes that intermediate language, and then write compilers that compile each target language -- C#, Java, whatever -- into your intermediate language. That means writing a lot of compilers, but only one analyzer, and maybe writing the analyzer is the hard part.
That was a very brief discussion, I know. It's a complex subject.
If the design of static analyzers on bytecode interests you, consider looking into the design of Infer, an open-source static analyzer for Java and other languages which turns Java bytecode into an even lower-level bytecode suitable for analysis of heap properties; read up on separation logic for inference of heap properties first. https://github.com/facebook/infer

Is it possible to use branch prediction hinting in C#?

For example, I know it is defined for gcc and used in the Linux kernel as:
#define likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)
If nothing like this is possible in C#, is the best alternative to manually reorder if-statements, putting the most likely case first? Are there any other ways to optimize based on this type of external knowledge?
On a related note, the CLR knows how to identify guard clauses and assumes that the alternate branch will be taken, making this optimization inappropriate to use on guard clases, correct?
(Note that I realize this may be a micro-optimization; I'm only interested for academic purposes.)

Short answer: No.
Longer Answer: You don't really need to in most cases. You can give hints by changing the logic in your statements. This is easier to do with a performance tool, like the one built into the higher (and more expensive) versions of Visual Studio, since you can capture the mispredicted branches counter. I realize this is for academic purposes, but it's good to know that the JITer is very good at optimizing your code for you. As an example (taken pretty much verbatim from CLR via C#)
This code:
public static void Main() {
Int32[] a = new Int32[5];
for(Int32 index = 0; index < a.Length; index++) {
// Do something with a[index]
}
}
may seem to be inefficient, since a.Length is a property and as we know in C#, a property is actually a set of one or two methods (get_XXX and set_XXX). However, the JIT knows that it's a property and either stores the length in a local variable for you, or inlines the method, to prevent the overhead.
...some developers have underestimated the abilities
of the JIT compiler and have tried to write “clever code” in an attempt to help the JIT
compiler. However, any clever attempts that you come up with will almost certainly impact
performance negatively and make your code harder to read, reducing its maintainability.
Among other things, it actually goes further and does the bounds checking once outside of the loop instead of inside the loop, which would degrade performance.
I realize it has little to do directly with your question, but I guess the point that I'm trying to make is that micro-optimizations like this don't really help you much in C#, because the JIT generally does it better, as it was designed exactly for this. (Fun fact, the x86 JIT compiler performs more aggressive optimizations than the x64 counterpart)
This article explains some of the optimizations that were added in .NET 3.5 SP1, among them being improvements to straightening branches to improve prediction and cache locality.
All of that being said, if you want to read a great book that goes into what the compiler generates and performance of the CLR, I recommend the book that I quoted from above, CLR via C#.
EDIT: I should mention that if this were currently possible in .NET, you could find the information in either the EMCA-335 standard or working draft. There is no standard that supports this, and viewing the metadata in something like IlDasm or CFF Explorer show no signs of any special metadata that can hint at branch predictions.

C# running faster than C++?

A friend and I have written an encryption module and we want to port it to multiple languages so that it's not platform specific encryption. Originally written in C#, I've ported it into C++ and Java. C# and Java will both encrypt at about 40 MB/s, but C++ will only encrypt at about 20 MB/s. Why is C++ running this much slower? Is it because I'm using Visual C++?
What can I do to speed up my code? Is there a different compiler that will optimize C++ better?
I've already tried optimizing the code itself, such as using x >> 3 instead of x / 8 (integer division), or y & 63 instead of y % 64 and other techniques. How can I build the project differently so that it is more performant in C++ ?
EDIT:
I must admit that I have not looked into how the compiler optimizes code. I have classes that I will be taking here in College that are dedicated to learning about compilers and interpreters.
As for my code in C++, it's not very complicated. There are NO includes, there is "basic" math along with something we call "state jumping" to produce pseudo random results. The most complicated things we do are bitwise operations that actually do the encryption and unchecked multiplication during an initial hashing phase. There are dynamically allocated 2D arrays which stay alive through the lifetime of the Encryption object (and properly released in a destructor). There's only 180 lines in this. Ok, so my micro-optimizations aren't necessary, but I should believe that they aren't the problem, it's about time. To really drill the point in, here is the most complicated line of code in the program:
input[L + offset] ^= state[state[SIndex ^ 255] & 63];
I'm not moving arrays, or working with objects.
Syntactically the entire set of code runs perfect and it'll work seamlessly if I were to encrypt something with C# and decrypt it with C++, or Java, all 3 languages interact as you'd expect they would.
I don't necessarily expect C++ to run faster then C# or Java (which are within 1 MB/s of each other), but I'm sure there's a way to make C++ run just as fast, or at least faster then it is now. I admit I'm not a C++ expert, I'm certainly not as seasoned in it as many of you seem to be, but if I can cut and paste 99% of the code from C# to C++ and get it to work in 5 mins, then I'm a little put out that it takes twice as long to execute.
RE-EDIT:
I found an optimization in Visual Studio I forgot to set before. Now C++ is running 50% faster then C#. Thanks for all the tips, I've learned a lot about compilers in my research.

Without source code it's difficult to say anything about the performance of your encryption algorithm/program.
I reckon though that you made a "mistake" while porting it to C++, meaning that you used it in a inefficient way (e.g. lots of copying of objects happens). Maybe you also used VC 6, whereas VC 9 would/could produce much better code.
As for the "x >> 3" optimization... modern compilers do convert integer division to bitshifts by themselves. Needless to say that this optimization may not be the bottleneck of your program at all. You should profile it first to find out where you're spending most of your time :)

The question is extreamly broad. Something that's efficient in C# may not be efficient in C++ and vice-versa.
You're making micro-optimisations, but you need to examine the overall design of your solution to make sure that it makes sense in C++. It may be a good idea to re-design large parts of your solution so that it works better in C++.
As with all things performance related, profile the code first, then modify, then profile again. Repeat until you've got to an acceptable level of performance.

Things that are 'relatively' fast in C# may be extremely slow in C++.
You can write 'faster' code in C++, but you can also write much slower code. Especially debug builds may be extremely slow in C++. So look at the type of optimizations by your compiler.
Mostly when porting applications, C# programmers tend to use the 'create a million newed objects' approach, which really makes C++ programs slow. You would rewrite these algorithm to use pre-allocated arrays and run with tight loops over these.
With pre-allocated memory you leverage the strengths of C++ in using pointers to memory by casting these to the right pod structured data.
But it really depends on what you have written in your code.
So measure your code an see where the implementations burn the most cpu, and then structure your code to use the right algorithms.

Your timing results are definitely not what I'd expect with well-written C++ and well-written C#. You're almost certainly writing inefficient C++. (Either that, or you're not compiling with the same sort of options. Make sure you're testing the release build, and check the optimization options.
However, micro-optimizations, like you mention, are going to do effectively nothing to improve the performance. You're wasting your time doing things that the compiler will do for you.
Usually you start by looking at the algorithm, but in this case we know the algorithm isn't causing the performance issue. I'd advise using a profiler to see if you can find a big time sink, but it may not find anything different from in C# or Java.
I'd suggest looking at how C++ differs from Java and C#. One big thing is objects. In Java and C#, objects are represented in the same way as C++ pointers to objects, although it isn't obvious from the syntax.
If you're moving objects about in Java and C++, you're moving pointers in Java, which is quick, and objects in C++, which can be slow. Look for where you use medium or large objects. Are you putting them in container classes? Those classes move objects around. Change those to pointers (preferably smart pointers, like std::tr1::shared_ptr<>).
If you're not experienced in C++ (and an experienced and competent C++ programmer would be highly unlikely to be microoptimizing), try to find somebody who is. C++ is not a really simple language, having a lot more legacy baggage than Java or C#, and you could be missing quite a few things.

Free C++ profilers:
What's the best free C++ profiler for Windows?

"Porting" performance-critical code from one language to another is usually a bad idea. You tend not to use the target language (C++ in this case) to its full potential.
Some of the worst C++ code I've seen was ported from Java. There was "new" for almost everything - normal for Java, but a sure performance killer for C++.
You're usually better off not porting, but reimplementing the critical parts.

The main reason C#/Java programs do not translate well (assuming everything else is correct). Is that C#/Java developers have not grokked the concept of objects and references correctly. Note in C#/Java all objects are passed by (the equivalent of) a pointer.
Class Message
{
char buffer[10000];
}
Message Encrypt(Message message) // Here you are making a copy of message
{
for(int loop =0;loop < 10000;++loop)
{
plop(message.buffer[loop]);
}
return message; // Here you are making another copy of message
}
To re-write this in a (more) C++ style you should probably be using references:
Message& Encrypt(Message& message) // pass a reference to the message
{
...
return message; // return the same reference.
}
The second thing that C#/Java programers have a hard time with is the lack of Garbage collection. If you are not releasing any memory correctly, you could start running low on memory and the C++ version is thrashing. In C++ we generally allocate objects on the stack (ie no new). If the lifetime of the object is beyond the current scope of the method/function then we use new but we always wrap the returned variable in a smart pointer (so that it will be correctly deleted).
void myFunc()
{
Message m;
// read message into m
Encrypt(m);
}
void alternative()
{
boost::shared_pointer<Message> m(new Message);
EncryptUsingPointer(m);
}

Show your code. We can't tell you how to optimize your code if we don't know what it looks like.
You're absolutely wasting your time converting divisions by constants into shift operations. Those kinds of braindead transformations can be made even by the dumbest compiler.
Where you can gain performance is in optimizations that require information the compiler doesn't have. The compiler knows that division by a power of two is equivalent to a right-shift.
Apart from this, there is little reason to expect C++ to be faster. C++ is much more dependent on you writing good code. C# and Java will produce pretty efficient code almost no matter what you do. But in C++, just one or two missteps will cripple performance.
And honestly, if you expected C++ to be faster because it's "native" or "closer to the metal", you're about a decade too late. JIT'ed languages can be very efficient, and with one or two exceptions, there's no reason why they must be slower than a native language.
You might find these posts enlightening.
They show, in short, that yes, ultimately, C++ has the potential to be faster, but for the most part, unless you go to extremes to optimize your code, C# will be just as fast, or faster.
If you want your C++ code to compete with the C# version, then a few suggestions:
Enable optimizations (you've hopefully already done this)
Think carefully about how you do disk I/O (IOStremas isn't exactly an ideal library to use)
Profile your code to see what needs optimizing.
Understand your code. Study the assembler output, and see what can be done more efficiently.
Many common operations in C++ are surprisingly slow. Dynamic memory allocation is a prime example. It is almost free in C# or Java, but very costly in C++. Stack-allocation is your friend.
Understand your code's cache behavior. Is your data scattered all over the place? It shouldn't be a surprise then that your code is inefficient.

Totally of topic but...
I found some info on the encryption module on the homepage you link to from your profile http://www.coreyogburn.com/bigproject.html
(quote)
Put together by my buddy Karl Wessels and I, we believe we have quite a powerful new algorithm.
What separates our encryption from the many existing encryptions is that ours is both fast AND secure. Currently, it takes 5 seconds to encrypt 100 MB. It is estimated that it would take 4.25 * 10^143 years to decrypt it!
[...]
We're also looking into getting a copyright and eventual commercial release.
I don't want to discourage you, but getting encryption right is hard. Very hard.
I'm not saying it's impossible for a twenty year old webdeveloper to develop an encryption algorithm that outshines all existing algorithms, but it's extremely unlikely, and I'm very sceptic, I think most people would be.
Nobody who cares about encryption would use an algorithm that's unpublished. I'm not saying you have to open up your sourcecode, but the workings of the algorithm must be public, and scrutinized, if you want to be taken seriously...

There are areas where a language running on a VM outperforms C/C++, for example heap allocation of new objects. You can find more details here.

There is a somwhat old article in Doctor Dobbs Journal named Microbenchmarking C++, C#, and Java where you can see some actual benchmarks, and you will find that C# sometimes is faster than C++. One of the more extreme examples is the single hash map benchmark. .NET 1.1 is a clear winner at 126 and VC++ is far behind at 537.
Some people will not believe you if you claim that a language like C# can be faster than C++, but it actually can. However, using a profiler and the very high level of fine-grained control that C++ offers should enable you to rewrite your application to be very performant.

When serious about performance you might want to be serious about profiling.
Separately, the "string" object implementation used in C# Java and C++, is noticeably slower in C++.

There are some cases where VM based languages as C# or Java can be faster than a C++ version. At least if you don't put much work into optimization and have a good knowledge of what is going on in the background. One reason is that the VMs can optimize byte-code at runtime and figure out which parts of the program are used often and changes its optimization strategy. On the other hand an old fashioned compiler has to decide how to optimize the program on compile-time and may not find the best solution.

The C# JIT probably noticed at run-time that the CPU is capable of running some advanced instructions, and is compiling to something better than what the C++ was compiled.
You can probably (surely with enough efforts) outperform this by compiling using the most sophisticated instructions available to the designated C.P.U and using knowledge of the algorithm to tell the compiler to use SIMD instructions at specific stages.
But before any fancy changes to your code, make sure are you C++ compiling to your C.P.U, not something much more primitive (Pentium ?).
Edit:
If your C++ program does a lot of unwise allocations and deallocations this will also explain it.

In another thread, I pointed out that doing a direct translation from one language to another will almost always end up in the version in the new language running more poorly.
Different languages take different techniques.

Try the intel compiler. Its much better the VC or gcc. As for the original question, I would be skeptical. Try to avoid using any containers and minimize the memory allocations in the offending function.

[Joke]There is an error in line 13[/Joke]
Now, seriously, no one can answer the question without the source code.
But as a rule of the thumb, the fact that C++ is that much slower than managed one most likely points to the difference of memory management and object ownership issues.
For instance, if your algorithm is doing any dynamic memory allocations inside the processing loop, this will affect the performance. If you pass heavy structures by the value, this will affect the performance. If you do unnecessary copies of objects, this will affect the performance. Exception abuse will cause performance to go south. And still counting.
I know the cases when forgotten "&" after the parameter name resulted in weeks of profiling/debugging:
void DoSomething(const HeavyStructure param); // Heavy structure will be copied
void DoSomething(const HeavyStructure& param); // No copy here
So, check your code to find possible bottlenecks.

C++ is not a language where you must use classes. In my opinion its not logical to use OOP methodologies where it doesnt really help. For a encrypter / decrypter its best not use classes; use arrays, pointers, use as few functions / classes / files possible. Best encryption system consists of a single file containing few functions. After your function works nice you can wrap it into classes if you wish. Also check the release build. There is huge speed difference

Nothing is faster than good machine/assembly code, so my goal when writing C/C++ is to write my code in such a way that the compiler understands my intentions to generate good machine code. Inlining is my favorite way to do this.
First, here's an aside. Good machine code:
uses registers more often than memory
rarely branches (if/else, for, and while)
uses memory more often than functions calls
rarely dynamically allocates any more memory (from the heap) than it already has
If you have a small class with very little code, then implement its methods in the body of the class definition and declare it locally (on the stack) when you use it. If the class is simple enough, then the compiler will often only generate a few instructions to effect its behavior, without any function calls or memory allocation to slow things down, just as if you had written the code all verbose and non-object oriented. I usually have assembly output turned on (/FAs /Fa with Visual C++) so I can check the output.
It's nice to have a language that allows you to write high-level, encapsulated object-oriented code and still translate into simple, pure, lightning fast machine code.

Here's my 2 cents.
I wrote a BlowFish cipher in C (and C#). The C# was almost 'identical' to the C.
How I compiled (i cant remember the numbers now, so just recalled ratios):
C native: 50
C managed: 15
C#: 10
As you can see, the native compilation out performs any managed version. Why?
I am not 100% sure, but my C version compiled to very optimised assembly code, the assembler output almost looked the same as a hand written assembler one I found.

Advantages of compilers for functional languages over compilers for imperative languages

As a follow up to this question What are the advantages of built-in immutability of F# over C#?--am I correct in assuming that the F# compiler can make certain optimizations knowing that it's dealing with largely immutable code? I mean even if a developer writes "Functional C#" the compiler wouldn't know all of the immutability that the developer had tried to code in so that it couldn't make the same optimizations, right?
In general would the compiler of a functional language be able to make optimizations that would not be possible with an imperative language--even one written with as much immutability as possible?

Am I correct in assuming that the F# compiler can make certain
optimizations knowing that it's dealing with largely immutable code?
Unfortunately not. To a compiler writer, there's a huge difference between "largely immutable" and "immutable". Even guaranteed immutability is not that important to the optimizer; the main thing that it buys you is you can write a very aggressive inliner.
In general would the compiler of a functional language be able to make optimizations that would not be possible with an imperative language--even one written with as much immutability as possible?
Yes, but it's mostly a question of being able to apply the classic optimizations more easily, in more places. For example, immutability makes it much easier to apply common-subexpression elimination because immutability can guarantee you that contents of certain memory cells are not changed.
On the other hand, if your functional language is not just immutable but pure (no side effects like I/O), then you enable a new class of optimizations that involve rewriting source-level expressions to more efficient expressions. One of the most important and more interesting to read about is short-cut deforestation, which is a way to avoid allocating memory space for intermediate results. A good example to read about is stream fusion.
If you are compiling a statically typed, functional language for high performance, here are some of the main points of emphasis:
Use memory effectively. When you can, work with "unboxed" values, avoiding allocation and an extra level of indirection to the heap. Stream fusion in particular and other deforestation techniques are all very effective because they eliminate allocations.
Have a super-fast allocator, and amortize heap-exhaustion checks over multiple allocations.
Inline functions effectively. Especially, inline small functions across module boundaries.
Represent first-class functions efficiently, usually through closure conversion. Handle partially applied functions efficiently.
Don't overlook the classic scalar and loop optimizations. They made a huge difference to compilers like TIL and Objective Caml.
If you have a lazy functional language like Haskell or Clean, there are also a lot of specialized things to do with thunks.
Footnotes:
One interesting option you get with total immutability is more ability to execute very fine-grained parallelism. The end of this story has yet to be told.
Writing a good compiler for F# is harder than writing a typical compiler (if there is such a thing) because F# is so heavily constrained: it must do the functional things well, but it must also work effectively within the .NET framework, which was not designed with functional languages in mind. We owe a tip of the hat to Don Syme and his team for doing such a great job on a heavily constrained problem.

No.
The F# compiler makes no attempt to analyze the referential transparency of a method or lambda. The .NET BCL is simply not designed for this.
The F# language specification does reserve the keyword 'pure', so manually marking a method as pure may be possible in vNext, allowing more aggressive graph reduction of lambda-expressions.
However, if you use the either record or algebraic types, F# will create default comparison and equality operators, and provide copy semantics. Amongst many other benefits (pattern-matching, closed-world assumption) this reduces a significant burden!

Yes, if you don't consider F#, but consider Haskell for instance. The fact that there are no side effects really opens up a lot of possibilities for optimization.
For instance consider in a C like language:
int factorial(int n) {
if (n <= 0) return 1;
return n* factorial(n-1);
}
int factorialuser(int m) {
return factorial(m) * factorial(m);
}
If a corresponding method was written in Haskell, there would be no second call to factorial when you call factorialuser. It might be possible to do this in C#, but I doubt the current compilers do it, even for a simple example as this. As things get more complicated, it would be hard for C# compilers to optimize to the level Haskell can do.
Note, F# is not really a "pure" functional language, currently. So, I brought in Haskell (which is great!).

Unfortunately, because F# is only mostly pure there aren't really that many opportunities for aggressive optimization. In fact, there are some places where F# "pessimizes" code compared to C# (e.g. making defensive copies of structs to prevent observable mutation). On the bright side, the compiler does a good job overall despite this, providing comparable performace to C# in most places nonetheless while simultaneously making programs easier to reason about.

I would say largely 'no'.
The main 'optimization' advantages you get from immutability or referential transparency are things like the ability to do 'common subexpression elimination' when you see code like ...f(x)...f(x).... But such analysis is hard to do without very precise information, and since F# runs on the .Net runtime and .Net has no way to mark methods as pure (effect-free), it requires a ton of built-in information and analysis to even try to do any of this.
On the other hand, in a language like Haskell (which mostly means 'Haskell', as there are few languages 'like Haskell' that anyone has heard of or uses :)) that is lazy and pure, the analysis is simpler (everything is pure, go nuts).
That said, such 'optimizations' can often interact badly with other useful aspects of the system (performance predictability, debugging, ...).
There are often stories of "a sufficiently smart compiler could do X", but my opinion is that the "sufficiently smart compiler" is, and always will be, a myth. If you want fast code, then write fast code; the compiler is not going to save you. If you want common subexpression elimination, then create a local variable (do it yourself).
This is mostly my opinion, and you're welcome to downvote or disagree (indeed I've heard 'multicore' suggested as a rising reason that potentially 'optimization may get sexy again', which sounds plausible on the face of it). But if you're ever hopeful about any compiler doing any non-trivial optimization (that is not supported by annotations in the source code), then be prepared to wait a long, long time for your hopes to be fulfilled.
Don't get me wrong - immutability is good, and is likely to help you write 'fast' code in many situations. But not because the compiler optimizes it - rather, because the code is easy to write, debug, get correct, parallelize, profile, and decide which are the most important bottlenecks to spend time on (possibly rewriting them mutably). If you want efficient code, use a development process that let you develop, test, and profile quickly.

Additional optimizations for functional languages are sometimes possible, but not necessarily because of immutability. Internally, many compilers will convert code into an SSA (single static assignment) form, where each local variable inside a function can only be assigned once. This can be done for both imperative and functional languages. For instance:
x := x + 1
y := x + 4
can become
x_1 := x_0 + 1
y := x_1 + 4
where x_0 and x_1 are different variable names. This vastly simplifies many transformations, since you can move bits of code around without worrying about what value they have at specific points in the program. This doesn't work for values stored in memory though (i.e., globals, heap values, arrays, etc). Again, this is done for both functional and imperative languages.
One benefit most functional languages provide is a strong type system. This allows the compiler to make assumptions that it wouldn't be able to otherwise. For instance, if you have two references of different types, the compiler knows that they cannot alias (point to the same thing). This is not an assumption a C compiler could ever make.

C# / F# Performance comparison

Is there any C#/F# performance comparison available on web to show proper usage of new F# language?

Natural F# code (e.g. functional/immutable) is slower than natural (imperative/mutable object-oriented) C# code. However, this kind of F# is much shorter than usual C# code.
Obviously, there is a trade-off.
On the other hand, you can, in most cases, achieve performance of F# code equal to performance of C# code. This will usually require coding in imperative or mutable object-oriented style, profile and remove bottlenecks. You use that same tools that you would otherwise use in C#: e.g. .Net reflector and a profiler.
That having said, it pays to be aware of some high-productivity constructs in F# that decrease performance. In my experience I have seen the following cases:
references (vs. class instance variables), only in code executed billions of times
F# comparison (<=) vs. System.Collections.Generic.Comparer, for example in binary search or sort
tail calls -- only in certain cases that cannot be optimized by the compiler or .Net runtime. As noted in the comments, depends on the .Net runtime.
F# sequences are twice slower than LINQ. This is due to references and the use of functions in F# library to implement translation of seq<_>. This is easily fixable, as you might replace the Seq module, by one with same signatures that uses Linq, PLinq or DryadLinq.
Tuples, F# tuple is a class sorted on the heap. In some case, e.g. a int*int tuple it might pay to use a struct.
Allocations, it's worth remembering that a closure is a class, created with the new operator, which remembers the accessed variables. It might be worth to "lift" the closure out, or replaced it with a function that explicitly takes the accessed variables as arguments.
Try using inline to improve performance, especially for generic code.
My experience is to code in F# first and optimize only the parts that matter. In certain cases, it might be easier to write the slow functions in C# rather that to try to tweak F#. However, from programmer efficiency point of view makes sense to start/prototype in F# then profile, disassemble and optimize.
Bottom line is, your F# code might end-up slower than C# because of program design decisions, but ultimately efficiency can be obtained.

See these questions that I asked recently:
Is a program F# any more efficient (execution-wise) than C#?
How can I use functional programming in the real world?
Is it possible that F# will be optimized more than other .Net languages in the future?

Here are a few links on (or related to) this topic:
http://cs.hubfs.net/forums/thread/3207.aspx
http://strangelights.com/blog/archive/2007/06/17/1588.aspx
http://khigia.wordpress.com/2008/03/30/ocaml-vs-f-for-big-integer-surprising-performance-test/
http://cs.hubfs.net/blogs/f_team/archive/2006/08/15/506.aspx
http://blogs.msdn.com/jomo_fisher/
What I seem to remember from another post on Robert Pickering's blog (or was it Scott Hanselman?) that in the end, because both are sitting on the same framework, you can get the same performance from both, but you sometimes have to 'twist' the natural expression of the language to do so. In the example I recall, he had to twist F# to get comparable performance with C#...

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.