Compile C#, so that it runs with the speed of C++

Compile C#, so that it runs with the speed of C++ - c#

Alright, so I wanted to ask if it's actually possible to make a parser from c# to c++.
So that code written in C# would be able to run as fast as code written in C++.
Is it actually possible to do? I'm not asking how hard is it going to be.

What makes you think that translating your C# code to C++ would magically make it faster?
Languages don't have a speed. Assuming that C# code is slower (I'll get back to that), it is because of what that code does (including the implicit requirements placed by C#, such as bounds checking on arrays), and not because of the language it is written in.
If you converted your C# code to C++, it would still need to do bounds checking on arrays, because the original source code expected this to happen, so it would have to do just as much work.
Moreover, C# often isn't slower than C++. There are plenty of benchmarks floating around on the internet, generally showing that for the most part, C# is as fast as (or faster than) C++. Only when you spend a lot of time optimizing your code, does C++ become faster.
If you want faster code, you need to write code that requires less work to execute, not try to change the source language. That's just cargo-cult programming at its worst. You once saw some efficient code, and that was written in C++, so now you try to make things C++, in the hope of attracting any efficiency that might be passing by.
It just doesn't work that way.

Although you could translate C# code to C++, there would be the issue that C# depends on the .Net framework libraries which are not native, so you could not simply translate C# code to C++.
Update
Also C# code depends on the runtime to do things such as memory management i.e. Garbage Collection. If you translated the C# code to C++, where would the memory management code be? Parsing and translating is not going to fix issues like that.

The Mono project has invested quite a lot of energy in turning LLVM into a native machine code compiler for the C# runtime, although there are some problems with specific language constructs like shared generics etc.. Check it out and take it for a spin.

You can use NGen to compile IL to native code

Performance related tweaks:
Platform independent
use a profiler to spot the bottlenecks;
prevent unnecessary garbage (spot it using generation #0 collect count and the Large Object heap)
prevent unnecessary copying (use struct wisely)
prevent unwarranted generics (code-sharing has unexpected performance side effects)
prefer oldfashioned loops over enumerator blocks when performance is an issue
When using LINQ watch closely where you maintain/break deferred evaluation. Both can be enormous boosts to performance
use reflection.emit/Expression Trees to precompile certain dynamic logic that is performance bottleneck
Mono
use Mono --gc=sgen --optimize=inline,... (the SGEN garbage collector can make orders of magnitude difference). See also man mono for a lot of tuning/optimization options
use MONO_GENERIC_SHARING=none to disable sharing of generics (making particular tasks a lot quicker especially when supporting both valuetypes and reftypes) (not recommended for regular production use)
use the -optimize+ compile flag (optimizing the CLR code independently from what the JITter may do with that)
Less mainstream:
use the LLVM backend (click the quote:)
This allows Mono to benefit from all of the compiler optimizations done in LLVM. For example the SciMark score goes from 482 to 610.
use mkbundle to create a statically linked NATIVE binary image (already fully JITted, i.e. AOT (ahead-of-time compiled))
MS .NET
Most of the above have direct Microsoft pendants (NGen, `/Optimize' etc.)
Of course MS don't have a switchable/tunable garbage collector, and I don't think a fully compiled native binary can be achieved like with mono.

As always the answer to making code run faster is:
Find the bottleneck and optimize that
Most of the time the bottleneck is either:
time spend in a critical loop
Review your algorithm and datastructure, do not change the language, the latter will give a 10% speedup, the first will give you a 1000x speedup.
If you're stuck on the best algorithm, you can always ask a specific, short and detailed question on SO.
time waiting for resources for a slow source
Reduce the amount of stuff you're requesting from the source
instead of:
SELECT * FROM bigtable
do
SELECT TOP 10 * FROM bigtable ORDER BY xxx
The latter will return instantly and you cannot show a million records in a meaningful way anyhow.
Or you can have the server at the order end reduce the data so that it doesn't take a 100 years to cross the network.
Alternativly you can execute the slow datafetch routine in a separate thread, so the rest of your program can do meaningful stuff instead of waiting.
Time spend because you are overflowing memory with Gigabytes of data
Use a different algorithm that works on a smaller dataset at a time.
Try to optimize cache usage.
The answer to efficient coding is measure where your coding time goes
Use a profiler.
see: http://csharp-source.net/open-source/profilers
And optimize those parts that eat more than 50% of your CPU time.
Do this for a number of iterations, and soon your 10 hour running time will be down to a manageable 3 minutes, instead of the 9.5 hours that you will get from switching to this or that better language.

Related

How do the .NET JITs optimise generated code layout?

Back in 2009 I posted this answer to a question about optimisations for nested try/catch/finally blocks.
Thinking about this again some years later, it seems the question could be extended to that other control flow, not only try/catch/finally, but also if/else.
At each of these junctions, execution will follow one path. Code must be generated for both, obviously, but the order in which they're placed in memory, and the number of jumps required to navigate through them will differ.
The order generated code is laid out in memory has implications for the miss rate on the CPU's instruction cache. Having the instruction pipeline stalled, waiting for memory reads, can really kill loop performance.
I don't think loops (for/foreach/while) are a such a good fit unless you expect the loop have zero iterations more often than it has some, as the natural generation order seems pretty optimal.
Some questions:
In what ways do the available .NET JITs optimise for generated instruction order?
How much difference can this make in practice to common code? What about perfectly suited cases?
Is there anything the developer can do to influence this layout? What about mangling with the forbidden goto?
Does the specific JIT being used make much difference to layout?
Does the method inlining heuristic come into play here too?
Basically anything interesting related to this aspect of the JIT!
Some initial thoughts:
Moving catch blocks out of line is an easy job, as they're supposed to be the exceptional case by definition. Not sure this happens.
For some loops I suspect you can increase performance non-trivially. However in general I don't think it'll make that much difference.
I don't know how the JIT decides the order of generated code. In C on Linux you have likely(cond) and unlikely(cond) which you can use to tell to the compiler which branch is the common path to optimise for. I'm not sure that all compilers respect these macros.
Instruction ordering is distinct from the problem of branch prediction, in which the CPU guesses (on its own, afaik) which branch will be taken in order to start the pipeline (oversimplied steps: decode, fetch operands, execute, write back) on instructions, before the execute step has determined the value of the condition variable.
I can't think of any way to influence this order in the C# language. Perhaps you can manipulate it a bit by gotoing to labels explicitly, but is this portable, and are there any other problems with it?
Perhaps this is what profile guided optimisation is for. Do we have that in the .NET ecosystem, now or in plan? Maybe I'll go and have a read about LLILC.

The optimization you are referring to is called the code layout optimization which is defined as follows:
Those pieces of code that are executed close in time in the same thread should be be close in the virtual address space so that they fit in a single or few consecutive cache lines. This reduces cache misses.
Those pieces of code that are executed close in time in different threads should be be close in the virtual address space so that they fit in a single or few consecutive cache lines as long as there is no self-modifying code. This gets lower priority than the previous one. This reduces cache misses.
Those pieces of code that are executed frequently (hot code) should be close in the virtual address space so that they fit in as few virtual pages as possible. This reduces page faults and working set size.
Those pieces of code that are rarely executed (cold code) should be close in the virtual address space so that they fit in as few virtual pages as possible. This reduces page faults and working set size.
Now to your questions.
In what ways do the available .NET JITs optimise for generated
instruction order?
"Instruction order" is really a very general term. Many optimizations affect instruction order. I'll assume that you're referring to code layout.
JITters by design should take the minimum amount of time to compile code while at the same time produce high-quality code. To achieve this, they only perform the most important optimizations so that it's really worth spending time doing them. Code layout optimization is not one of them because without profiling, it may not be beneficial. While a JITter can certainly perform profiling and dynamic optimization, there is a generally preferred way.
How much difference can this make in practice to common code? What
about perfectly suited cases?
Code layout optimization by itself can improve overall performance typically by -1% (negative one) to 4%, which is enough to make compiler writers happy. I would like to add that it reduces energy consumption indirectly by reducing cache misses. The reduction in miss ratio of the instruction cache can be typically up to 35%.
Is there anything the developer can do to influence this layout? What
about mangling with the forbidden goto?
Yes, there are numerous ways. I would like to mention the generally recommended one which is mpgo.exe. Please do not use goto for this purpose. It's forbidden.
Does the specific JIT being used make much difference to layout?
No.
Does the method inlining heuristic come into play here too?
Inlining can indeed improve code layout with respect to function calls. It's one of the most important optimizations and all .NET JITs perform it.
Moving catch blocks out of line is an easy job, as they're supposed to
be the exceptional case by definition. Not sure this happens.
Yes it might be "easy", but what is the potential gained benefit? catch blocks are typically small in size (containing a call to a function that handles the exception). Handling this particular case of code layout does not seem promising. If you really care, use mpgo.exe.
I don't know how the JIT decides the order of generated code. In C on
Linux you have likely(cond) and unlikely(cond) which you can use to
tell to the compiler which branch is the common path to optimise for.
Using PGO is much more preferable over using likely(cond) and unlikely(cond) for two reasons:
The programmer might inadvertently make mistakes while placing likely(cond) and unlikely(cond) in the code. It actually happens a lot. Making big mistakes while trying to manually optimize the code is very typical.
Adding likely(cond) and unlikely(cond) all over the code makes it less maintainable in the future. You'll have to make sure that these hints hold every time you change the source code. In large code bases, this could be ( or rather is) a nightmare.
Instruction ordering is distinct from the problem of branch
prediction...
Assuming you are talking about code layout, yes they are distinct. But code layout optimization is usually guided by a profile which really includes branch statistics. Hardware branch prediction is of course totally different.
Maybe I'll go and have a read about LLILC.
While using mpgo.exe is the mainstream way of performing this optimization, you can use LLILC also since LLVM support profile-guided optimization as well. But I don't think you need to go this far.

C# Performance - should I write the computation heavy methods in c++?

I am building a prototype for a quantitative library that does some signal analysis using image processing techniques. I built the initial prototype entirely in C#, but the performance is not as good as expected. Most of the computation is done through heavy matrix calculations, and these are taking up most of the time.
I am wondering if it is worth it to write a C++/CLI interface to unmanaged C++ code. Has anyone ever gone through this? Other suggestions for optimizing C# performance is welcome.

There was a time where it would definitely be better to write in C/C++, but the C# optimizer and JIT is so good now, that for pure math, there's probably no difference.
The difference comes when you have to deal with memory and possibly arrays. Even so, I'd still work with C# (or F#) and then optimize hotspots. The JIT is really good at optimizing away small, short-lived objects.
With arrays, you have to worry about C# doing bounds-checks on each access. Read this:
Link
Test it yourself -- I've been finding C# to be comparable -- sometimes faster.

It's hard to give a definitive answer here, but if performance is an issue I'd find a time-tested library with the performance you need and wrap it.
Something simple like multiplication or division is not much different between c++ and c# - the c++ compiler has an optimizer, and the CLR runtime has the on-demand JITer that does optimizations. So in theory, the c++ would outperform c# only on the first call.
However, theory and practice are not the same. With more complicated algorithms you also run into the differences between memory managers, and the maturity of the optimization techniques. If you want anecdotal evidence, you can find some math-heavy comparisons here.
Personally, I find doing the heavy computations in a native library and using c++/CLI to call it gives a good boost when the computations are the biggest bottleneck. As always, make sure that's the case before doing any optimization.

Matrix math is best done in native code in my opinion. Even the C++ libraries typically allow binding to a lower-level implementation like LAPACK.
There is a C# LAPACK port here (also C# BLAS on the same site) which you could try but I'd be surprised if this is faster than native code.

I've done a lot of image processing work in C# and, yes, I usually do use native code for heavy duty code where performance matters but I used just PInvokes and not the C++/CLI interface. A lot of time this is not needed, though.
There's quite a few good .NET profilers. The Red Gate one is my personal favorite. It might help you to visualize where the bottlenecks are.

The only reasonable language benchmark out there: http://shootout.alioth.debian.org/
See for yourself.

Performance for mathematical computation is pretty poor in C#. I was gobbsmacked to find how slow mathematical calculations are in C#. Just write a loop in C# and C++ having a few Multiplication, Sin, Cos, ... and the difference is immense.
I do not know Managed C++ but imlpementing it all in unmanaged C++ abd I would umagine exposing granular interfaces through P/Invoke should have little perfromance hit.
That is what I have done for heavy real-time image processing.

I built the initial prototype entirely in C#, but the performance is not as good as expected.
then you have two options:
Build another prototype in C++, and see how that compares, or optimize your C# code. No matter what language you write in, your code won't be fast until you've profiled and optimized and profiled and optimized it. That is especially true in C++. If you write the fastest possible implementation in C# and compare it to the fastest possible implementation in C++, then the C++ version will most likely be faster. But it will come at a cost in terms of development time. It's not trivial to write efficient C++ code. If you are new to the language then you will most likely write very inefficient code, especially if you are coming from C# or Java, where things are done differently and have different costs.
If you just write a working implementation, without worrying too much about performance, then I'm guessing that the C# version will probably be faster.
But it really depends on what kind of performance you're after (and not least, how expensive the operations you need to perform are. There's an overhead associated with the transition from managed to native code, so it is not worth it for short operations that are executed often.
Number-crunching code in C++ can be as fast as code written in Fortran (give or take a few percent), but to achieve that, you need to use a lot of advanced techniques (expression templates and lots of metaprogramming) or some fairly complex libraries which implement it for you.
Is that worth it? Or can C# be made fast enough for your needs?

You should write computational heavy programs in C++, you cannot reach anywhere near to C++ performance by optimizing C#. The overhead of calling wrappers is negligible assuming the computation takes considerable time. I have done coding in both C++ and C# and have never seen any occasion where .NET framework code come comparable to C++. There are a few instances where C# where runs better, but it was better because of lack of appropriate libraries or bad coding in C++. If you can write code equally well in C# and C++, I would write performance code in C++ and everything else is C#.
If x is the world best C++ programmer and y is the best C# programmer then most of the times x can write faster code than y. However, y can finish the coding faster than x most of the times.

Suitability of C# for clustered calculation-heavy apps?

I'm preparing to write a photonic simulation package that will run on a 128-node Linux and Windows cluster, with a Windows-based client for designing jobs (CAD-like) and submitting them to to the cluster.
Most of this is well-trod ground, but I'm curious how C# stacks up to C++ in terms of real number-crunching ability. I'm very comfortable with both languages, but I find the superior object model and framework support of C# with .NET or Mono incredibly enticing. However, I can't, with this application, sacrifice too much in processing power for the sake of developer preference.
Does anyone have any experience in this area? Are there any hard benchmarks available? I'd assume that the the final machine code would be optimized using the same techniques whether it comes from a C# or C++ source, especially since that typically takes place at the pcode/IL level.

The optimisation techniques employed by C# and native C++ are vastly different. C# compilers emit IL, which is only marginally optimised and then JIT'ed to binary code when it is about to execute for the first time. Most of the optimisation work happens inside the JIT compiler.
This has pros and cons. JIT has time budgets, which limits how much effort it can expend on optimisation. But it also has intimate knowledge of the hardware it is actually running on, so it can (in theory) make transparent use of newer CPU opcodes and detailed knowledge of performance data such as a pipeline hazards database.
In practice, I don't know how significant the latter is. I do know that at least Mono will parallelise some loops automatically if it finds itself running on a CPU with SSE (SSE2, perhaps?), which may be a big deal for your scenario.

I did a quick search and found this:
http://www.drdobbs.com/184401976;jsessionid=232QX0GU3C3KXQE1GHOSKH4ATMY32JVN
Edit: Bear in mind (on reading the article) that this was done 5 years ago so performance is likely to be better all round!

I have found these performance comparisons:
http://reverseblade.blogspot.com/2009/02/c-versus-c-versus-java-performance.html
http://systematicgaming.wordpress.com/2009/01/03/performance-c-vs-c/
http://journal.stuffwithstuff.com/2009/01/03/debunking-c-vs-c-performance/
http://www.csharphelp.com/2007/01/managed-c-versus-unmanaged-c/
And here is even a case study:
http://www.itu.dk/~sestoft/papers/numericperformance.pdf
Hope it helps.

Would there be any point in designing a CPU that could handle IL directly?

If I understand this correctly:
Current CPU developing companies like AMD and Intel have their own API codes (the assembly language) as what they see as the 2G language on top of the Machine code (1G language)
Would it be possible or desirable (performance or otherwise) to have a CPU that would perform IL handling at it's core instead of the current API calls?

A similar technology does exist for Java - ARM do a range of CPUs that can do this, they call it their "Jazelle" technology.
However, the operations represented by .net IL opcodes are only well-defined in combination with the type information held on the stack, not on their own. This is a major difference from Java bytecode, and would make it much more difficult to create sensible hardware to execute IL.
Moreover, IL is intended for compilation to a final target. Most back ends that spit out IL do very little optimisation, aiming instead to preserve semantic content for verification and optimisation in the final compilation step. Even if the hardware problems could be overcome, the result will almost certainly still be slower than a decent optimising JIT.
So, to sum up: while it is not impossible, it would be disproportionately hard compared to other architectures, and would achieve little.

You seem a bit confused about how CPU's work. Assembly is not a separate language from machine code. It is simply a different (textual) representation of it.
Assembly code is simply a sequential listing of instructions to be executed. And machine code is exactly the same thing. Every instruction supported by the CPU has a certain bit-pattern that cause it to be executed, and it also has a textual name you can use in assembly code.
If I write add $10, $9, $8 and run it through an assembler, I get the machine code for the add instruction, taking the values in registers 9 and 8, adding them and storing the result in register 10.
There is a 1 to 1 mapping between assembler and machine code.
There also are no "API calls". The CPU simply reads from address X, and matches the subsequent bits against all the instructions it understands. Once it finds an instruction that matches this bit pattern, it executes the instruction, and moves on to read the next one.
What you're asking is in a sense impossible or a contradiction. IL stands for Intermediate Language, that is, a kind of pseudocode that is emitted by the compiler, but has not yet been translated into machine code. But if the CPU could execute that directly, then it would no longer be intermediate, it would be machine code.
So the question becomes "is your IL code a better, more efficient representation of a program, than the machine code the CPU supports now?"
And the answer is most likely no. MSIL (I assume that's what you mean by IL, which is a much more general term) is designed to be portable, simple and consistent. Every .NET language compiles to MSIL, and every MSIL program must be able to be translated into machine code for any CPU anywhere. That means MSIL must be general and abstract and not make assumptions about the CPU. For this reason, as far as I know, it is a purely stack-based architecture. Instead of keeping data in registers, each instruction processes the data on the top of the stack. That's a nice clean and generic system, but it's not very efficient, and doesn't translate well to the rigid structure of a CPU. (In your wonderful little high-level world, you can pretend that the stack can grow freely. For the CPU to get fast access to it, it must be stored in some small, fast on-chip memory with finite size. So what happens if your program push too much data on the stack?)
Yes, you could make a CPU to execute MSIL directly, but what would you gain?
You'd no longer need to JIT code before execution, so the first time you start a program, it would launch a bit faster. Apart from that, though? Once your MSIL program has been JIT'ed, it has been translated to machine code and runs as efficiently as if it had been written in machine code originally. MSIL bytecode no longer exists, just a series of instructions understood by the CPU.
In fact, you'd be back where you were before .NET. Non-managed languages are compiled straight to machine code, just like this would be in your suggestion. The only difference is that non-managed code targets machine code that is designed by CPU designers to be suitable for execution on a CPU, while in your case, it'd target machine code that's designed by software designers to be easy to translate to and from.

This is not a new idea - the same thing was predicted for Java, and Lisp machines were even actually implemented.
But experience with those shows that it's not really useful - by designing special-purpose CPUs, you can't benefit from the advances of "traditional" CPUs, and you very likely can't beat Intel at their own game. A quote from the Wikipedia article illustrates this nicely:
cheaper desktop PCs soon were able to
run Lisp programs even faster than
Lisp machines, without the use of
special purpose hardware.
Translating from one kind of machine code to another on the fly is a well-understood problem and so common (modern CISC CPUs even do something like that internally because they're really RISC) that I believe we can assume it is being done efficiently enough that avoiding it does not yield significant benefits - not when it means you have to decouple yourself from the state of the art in traditional CPUs.

I would say no.
The actual machine language instructions that need to run on a computer are lower level than IL. IL, for example, doesn't really describe how methods calls should be made, how registers should be managed, how the stack should be accessed, or any other of the details that are needed at the machine code level.
Getting the machine to recognize IL directly would, therefore, simple move all the JIT compilation logic from software into hardware.
That would make the whole process very rigid and unchangeable.
By having the machine language based on the capabilities of the machine, and an intermediate language based on capturing programmer intent, you get a much better system. The folks defining the machine can concentrate on defining an efficient computer architecture, and the folks defining the IL system can focus on things like expressiveness and safety.
If both the tool vendors and the hardware vendors had to use the exact same representation for everything, then innovation in either the hardware space or the tool space would be hampered. So, I say they should be separate from one another.

I wouldn't have thought so for two reasons:-
If you had hardware processing IL that hardware would not be able to run a newer version of IL. With JIT you just need a new JIT then existing hardware can run the newer IL.
IL is simple and is designed to be hardware agnostic. The IL would have to become much more complex to enable it to describe operations in the most effecient manner in order to get anywhere close to the performance of existing machine code. But that would mean the IL would be much harder to run on non IL specific hardware.

C# Performance For Proxy Server (vs C++)

I want to create a simple http proxy server that does some very basic processing on the http headers (i.e. if header x == y, do z). The server may need to support hundreds of users. I can write the server in C# (pretty easy) or c++ (much harder). However, would a C# version have as good of performance as a C++ version? If not, would the difference in performance be big enough that it would not make sense to write it in C#?

You can use unsafe C# code and pointers in critical bottleneck points to make it run faster. Those behave much like C++ code and I believe it executes as fast.
But most of the time, C# is JIT-ted to uber-fast already, I don't believe there will be much differences as with what everyone has said.
But one thing you might want to consider is: Managed code (C#) string operations are rather slow compared to using pointers effectively in C++. There are more optimization tricks with C++ pointers than with CLR strings.
I think I have done some benchmarks before, but can't remember where I've put them.

Why do you expect a much higher performance from the C++ application?
There is no inherent slowdown added by a C# application when you are doing it right. (not too many dropped references, frequent object creation/dropping per call, etc.)
The only time a C++ application really outperforms an equivalent C# application is when you can do (very) low level operations. E.g. casting raw memory pointers, inline assembler, etc.
The C++ compiler may be better at creating fast code, but mostly this is wasted in most applications. If you do really have a part of your application that must be blindingly fast, try writing a C call for that hot spot.
Only if most of the system behaves too slowly you should consider writing it in C/C++. But there are many pitfalls that may kill your performance in your C++ code.
(TLDR: A C++ expert may create 'faster' code as an C# expert, but a mediocre C++ programmer may create slower code than mediocre C# one)

I would expect the C# version to be nearly as fast as the C++ one but with smaller memory footprint.
In some cases managed code is actually a LOT faster and uses less memory compared to non optimized C++. C++ code can be faster if written by expert, but it rarely justifies the effort.
As a side note I can recall a performance "competition" in the blogosphere between Michael Kaplan (c#) and Raymond Chan (C++) to write a program, that does exactly the same thing. Raymond Chan, who is considered one of the best programmers in the world (Joel) succeeded to write faster C++ after a long struggle rewriting most of the code.

The proxy server you describe would deal mostly with string data and I think its reasonable to implement in C#. In your example,
if header x == y, do z
the slowest part might actually be doing whatever 'z' is and you'll have to do that work regardless of the language.

In my experience, the design and implementation has much more to do with performance than do the choice of language/framework (however, the usual caveats apply: eg, don't write a device driver in C# or java).
I wouldn't think twice about writing the type of program you describe in a managed language (be it Java, C#, etc). These days, the performance gains you get from using a lower level language (in terms of closeness to hardware) is often easily offset by the runtime abilities of a managed environment. Of course this is coming from a C#/python developer so I'm not exactly unbiased...

If you need a fast and reliable proxy server, it might make sense to try some of those that already exist. But if you have custom features that are required, then you may have to build your own. You may want to collect some more information on the expected load: hundreds of users might be a few requests a minute or a hundred requests a second.
Assuming you need to serve under or around 200 qps on a single machine, C# should easily meet your needs -- even languages known for being slow (e.g. Ruby) can easily pump out a few hundred requests a second.
Aside from performance, there are other reasons to choose C#, e.g. it's much easier to write buffer overflows in C++ than C#.

Is your http server going to run on a dedicated machine? If yes, I would say go with C# if it is easier for you. If you need to run other applications on the same machine, you'll need to take into account the memory footprint of your application and the fact that GC will run at "random" times.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.