"Imprecise faults" and SIMD - c#

I'm looking through the CIL Spec. In an appendix, it talks about "Imprecise faults", meaning that a user could specify that the exact order of null reference exceptions, etc. could be relaxed. The appendix talks about various ways in which this could be used by the JITer to improve performance.
One specific subsection that caught my eye:
F.5.2 Vectorizing a loop
Vectorizing a loop usually requires knowing
two things:
The loop iterations are independent
The number of loop iterations is known.
In a method relaxed for the checks that might fault, part 1 is
frequently false, because the possibility of a fault induces a control
dependence from each loop iteration to succeeding loop iterations. In
a relaxed method, those control dependences can be ignored. In most
cases, relaxed methods simplify vectorization by allowing checks to be
hoisted out of a loop. Nevertheless, even when such hoisting is not
possible, ignoring cross-iteration dependences implied by faults can
be crucial to vectorization for “short vector” SIMD hardware such as
IA-32 SSE or PowerPC Altivec.
For example, consider this loop:
for (k = 0; k < n; k++) {
x[k] = x[k] + y[k] * s[k].a;
}
where s is an array of references. The checks for null references
cannot be hoisted out of the loop, even in a relaxed context. But
relaxed does allow “unroll-and-jam” to be applied successfully. The
loop can be unrolled by a factor of 4 to create aggregate iterations,
and the checks hoisted to the top of each aggregate iteration.
That is, it's suggesting that the loop could be automatically turned to SIMD operations by the JITer if it were using these relaxed faults. The spec suggests that you can set these relaxed faults by using the System.Runtime.CompilerServices.CompilationRelaxations enum. But in actual C# the enum only has the NoStringInterning option without any of the others. I've tried hard setting the System.Runtime.CompilerServices.CompilationRelaxationsAttribute to some int codes pulled from other sources, but there was no difference in the x86 assembly produced.
So as far as I can tell the official Microsoft JIT does not implement this. And I know Mono has the Mono.Simd namespace, so my guess is it doesn't implement this, either.
So I'm curious if there's some piece of history about that appendix (and section 12.6.4 "Optimization", which talks about this, too) that I'm missing. Why is it in the standard if neither major vendor actually implements it? Are there plans from Microsoft to work on it in the future?

So I'm curious if there's some piece of history about that appendix (and section 12.6.4 "Optimization", which talks about this, too) that I'm missing. Why is it in the standard if neither major vendor actually implements it? Are there plans from Microsoft to work on it in the future?
I suspect this was put in the specifically to provide the option to allow this to be implemented at some point without breaking the implementation or requiring a specification change.
But in actual C# the enum only has the NoStringInterning option without any of the others
This is because the NoStringInterning is the only supported option at this time. As enum in C# is extensible (its just an underlying integer type), a future version of the runtime could easily be extended to support other options.
Note that there are suggestions on the VS UserVoice site for Microsoft to make improvements in this area.

Such are the burdens of the guy that has to write the CLI spec, he doesn't yet know if actually implementing this in a jitter is practical. That happens later.
SIMD is a problem, it has a pretty hard variable alignment requirement. At least around the time that the x86 jitter was written, trying to apply a SIMD instruction on a mis-aligned variable produced a hard bus fault. Not so sure what state of the art was when the x64 jitter was written but today it is still very expensive. The x86 jitter can't do better than 4 byte alignment, x64 can't do better than 8. It might require the next generation 128-bit core to get the 16-byte alignment to really make it effective. I'm not holding my breath for that :)

Related

Is it possible to use branch prediction hinting in C#?

For example, I know it is defined for gcc and used in the Linux kernel as:
#define likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)
If nothing like this is possible in C#, is the best alternative to manually reorder if-statements, putting the most likely case first? Are there any other ways to optimize based on this type of external knowledge?
On a related note, the CLR knows how to identify guard clauses and assumes that the alternate branch will be taken, making this optimization inappropriate to use on guard clases, correct?
(Note that I realize this may be a micro-optimization; I'm only interested for academic purposes.)
Short answer: No.
Longer Answer: You don't really need to in most cases. You can give hints by changing the logic in your statements. This is easier to do with a performance tool, like the one built into the higher (and more expensive) versions of Visual Studio, since you can capture the mispredicted branches counter. I realize this is for academic purposes, but it's good to know that the JITer is very good at optimizing your code for you. As an example (taken pretty much verbatim from CLR via C#)
This code:
public static void Main() {
Int32[] a = new Int32[5];
for(Int32 index = 0; index < a.Length; index++) {
// Do something with a[index]
}
}
may seem to be inefficient, since a.Length is a property and as we know in C#, a property is actually a set of one or two methods (get_XXX and set_XXX). However, the JIT knows that it's a property and either stores the length in a local variable for you, or inlines the method, to prevent the overhead.
...some developers have underestimated the abilities
of the JIT compiler and have tried to write “clever code” in an attempt to help the JIT
compiler. However, any clever attempts that you come up with will almost certainly impact
performance negatively and make your code harder to read, reducing its maintainability.
Among other things, it actually goes further and does the bounds checking once outside of the loop instead of inside the loop, which would degrade performance.
I realize it has little to do directly with your question, but I guess the point that I'm trying to make is that micro-optimizations like this don't really help you much in C#, because the JIT generally does it better, as it was designed exactly for this. (Fun fact, the x86 JIT compiler performs more aggressive optimizations than the x64 counterpart)
This article explains some of the optimizations that were added in .NET 3.5 SP1, among them being improvements to straightening branches to improve prediction and cache locality.
All of that being said, if you want to read a great book that goes into what the compiler generates and performance of the CLR, I recommend the book that I quoted from above, CLR via C#.
EDIT: I should mention that if this were currently possible in .NET, you could find the information in either the EMCA-335 standard or working draft. There is no standard that supports this, and viewing the metadata in something like IlDasm or CFF Explorer show no signs of any special metadata that can hint at branch predictions.

In C#, why is "int" an alias for System.Int32?

Since C# supports Int8, Int16, Int32 and Int64, why did the designers of the language choose to define int as an alias for Int32 instead of allowing it to vary depending on what the native architecture considers to be a word?
I have not had any specific need for int to behave differently than the way it does, I am only asking out of pure encyclopedic interest.
I would think that a 64-bit RISC architecture could conceivably exist which would most efficiently support only 64-bit quantities, and in which manipulations of 32-bit quantities would require extra operations. Such an architecture would be at a disadvantage in a world in which programs insist on using 32-bit integers, which is another way of saying that C#, becoming the language of the future and all, essentially prevents hardware designers from ever coming up with such an architecture in the future.
StackOverflow does not encourage speculating answers, so please answer only if your information comes from a dependable source. I have noticed that some members of SO are Microsoft insiders, so I was hoping that they might be able to enlighten us on this subject.
Note 1: I did in fact read all answers and all comments of SO: Is it safe to assume an int will always be 32 bits in C#? but did not find any hint as to the why that I am asking in this question.
Note 2: the viability of this question on SO is (inconclusively) discussed here: Meta: Can I ask a “why did they do it this way” type of question?
I believe that their main reason was portability of programs targeting CLR. If they were to allow a type as basic as int to be platform-dependent, making portable programs for CLR would become a lot more difficult. Proliferation of typedef-ed integral types in platform-neutral C/C++ code to cover the use of built-in int is an indirect hint as to why the designers of CLR decided on making built-in types platform-independent. Discrepancies like that are a big inhibitor to the "write once, run anywhere" goal of execution systems based on VMs.
Edit More often than not, the size of an int plays into your code implicitly through bit operations, rather than through arithmetics (after all, what could possibly go wrong with the i++, right?) But the errors are usually more subtle. Consider an example below:
const int MaxItem = 20;
var item = new MyItem[MaxItem];
for (int mask = 1 ; mask != (1<<MaxItem) ; mask++) {
var combination = new HashSet<MyItem>();
for (int i = 0 ; i != MaxItem ; i++) {
if ((mask & (1<<i)) != 0) {
combination.Add(item[i]);
}
}
ProcessCombination(combination);
}
This code computes and processes all combinations of 20 items. As you can tell, the code fails miserably on a system with 16-bit int, but works fine with ints of 32 or 64 bits.
Unsafe code would provide another source of headache: when the int is fixed at some size (say, 32) code that allocates 4 times the number of bytes as the number of ints that it needs to marshal would work, even though it is technically incorrect to use 4 in place of sizeof(int). Moreover, this technically incorrect code would remain portable!
Ultimately, small things like that play heavily into the perception of platform as "good" or "bad". Users of .NET programs do not care that a program crashes because its programmer made a non-portable mistake, or the CLR is buggy. This is similar to the way the early Windows were widely perceived as non-stable due to poor quality of drivers. To most users, a crash is just another .NET program crash, not a programmers' issue. Therefore is is good for perception of the ".NET ecosystem" to make the standard as forgiving as possible.
Many programmers have the tendency to write code for the platform they use. This includes assumptions about the size of a type. There are many C programs around which will fail if the size of an int would be changed to 16 or 64 bit because they were written under the assumption that an int is 32 bit. The choice for C# avoids that problem by simply defining it as that. If you define int as variable depending on the platform you by back into that same problem. Although you could argue that it's the programmers fault of making wrong assumptions it makes the language a bit more robust (IMO). And for desktop platforms a 32 bit int is probably the most common occurence. Besides it makes porting native C code to C# a bit easier.
Edit: I think you write code which makes (implicit) assumptions about the sizer of a type more often than you think. Basically anything which involves serialization (like .NET remoting, WCF, serializing data to disk, etc.) will get you in trouble if you allow variables sizes for int unless the programmer takes care of it by using the specific sized type like int32. And then you end up with "we'll use int32 always anyway just in case" and you have gained nothing.

Get Calling Object's HashCode?

This might be a duplicate, but I haven't seen this exact question or a similar one asked/answered with a date newer than the release of .Net 4.
I'm looking for a temporary hack that allows me to look through the call stack and get all calling objects (not methods, but the instances that hold the methods) in the stack. Ultimately I need their hashcodes.
Is this possible?
EDIT:
Whether it came across in my question or not, was really asking if there was a simple/built-in way to do this. Really, just a stop-gap fix until I can make breaking changes to other parts of the system. Thanks for the great answers. After seeing them, I think I'll wait . . . :)
What are you trying to achieve here?
Have a look at a similar question I answered about a month ago: How to get current value of EIP in managed code?. You might get some inspiration from that. Or you might decide it is too ugly (+1 for the latter).
If all you want to do is assemble 'unique' call paths within a program session, go right ahead: I'd be very sure to use an AOP weaver and thread local storage. It wouldn't be too hard that way.
Caveat 1: Hashes are not very useful for generic .NET objects
A random object's hashcode may vary with it's location on the heap to begin with. For reference: on MONO, with the moving heap allocator disabled, Object::GetHash is this pretty blob of code (mono/metadata/monitor.c)
#else
/*
* Wang's address-based hash function:
* http://www.concentric.net/~Ttwang/tech/addrhash.htm
*/
return (GPOINTER_TO_UINT (obj) >> MONO_OBJECT_ALIGNMENT_SHIFT) * 2654435761u;
#endif
Of course, with the moving allocator things are slightly more complex to guarantee a constant hash over the lifetime of the object, but you get the point: each runtime will generate different hashes, and the amount of allocations done will alter the future default hash codes of the identical objects.
Caveat 2: Your stack will contain alien frames
Even if you got that part fixed by supplying proper deterministic hash functions, you will require each stackframe to be of 'recgonizable' type. This is probably not going to be the case. Certainly not if you use anything akin to LINQ, anonymous types, static constructors, delegates; all kinds of things could be interleaving stack frames with those of (anonymous) helper types, or even performance trampolines invented by the JIT compiler to optimize tail recursion, a large switch jump table or sharing code between multiple overloads.
Takeaway: stack analysis is hard: you should definitely use the proper API if you are going to undertake it.
Conclusion:
By all means have a ball. But heed the advice
your requirements are non-standard (underlined by the runtime library not supporting it); This is usually a sign that: you are solving a unique problem (but reconsider the tool chosen?) or you are solving it the wrong way
You could perhaps get a lot more info by generating a flow graph with some handwritten simulation code instead of trying to hook into the CLR VM
if you're gonna do it, use the proper API (probably the profiler API since a sampling profiler will save exactly this: stack 'fingerprints' every so-many instructions)
If you really must do it by instrumenting your code, consider using AOP
You can get the call stack by creating an instance of the StackTrace class and inspecting the StackFrame objects within it. Looking at the member list, this doesn't seem to reveal the instances, though, just the classes and methods.
This is possible only with unmanaged APIs, specifically with the CLR profiling API. I know nithing about it, other than it is used to implement profiling and debugging tools. You have to google it and be comfortable with burning 1 week bringing it to production. If at all possible, give up your plan and find an alternative. Tell us what you want to do and we can help!
Try Environment.StackTrace.

Advantages of compilers for functional languages over compilers for imperative languages

As a follow up to this question What are the advantages of built-in immutability of F# over C#?--am I correct in assuming that the F# compiler can make certain optimizations knowing that it's dealing with largely immutable code? I mean even if a developer writes "Functional C#" the compiler wouldn't know all of the immutability that the developer had tried to code in so that it couldn't make the same optimizations, right?
In general would the compiler of a functional language be able to make optimizations that would not be possible with an imperative language--even one written with as much immutability as possible?
Am I correct in assuming that the F# compiler can make certain
optimizations knowing that it's dealing with largely immutable code?
Unfortunately not. To a compiler writer, there's a huge difference between "largely immutable" and "immutable". Even guaranteed immutability is not that important to the optimizer; the main thing that it buys you is you can write a very aggressive inliner.
In general would the compiler of a functional language be able to make optimizations that would not be possible with an imperative language--even one written with as much immutability as possible?
Yes, but it's mostly a question of being able to apply the classic optimizations more easily, in more places. For example, immutability makes it much easier to apply common-subexpression elimination because immutability can guarantee you that contents of certain memory cells are not changed.
On the other hand, if your functional language is not just immutable but pure (no side effects like I/O), then you enable a new class of optimizations that involve rewriting source-level expressions to more efficient expressions. One of the most important and more interesting to read about is short-cut deforestation, which is a way to avoid allocating memory space for intermediate results. A good example to read about is stream fusion.
If you are compiling a statically typed, functional language for high performance, here are some of the main points of emphasis:
Use memory effectively. When you can, work with "unboxed" values, avoiding allocation and an extra level of indirection to the heap. Stream fusion in particular and other deforestation techniques are all very effective because they eliminate allocations.
Have a super-fast allocator, and amortize heap-exhaustion checks over multiple allocations.
Inline functions effectively. Especially, inline small functions across module boundaries.
Represent first-class functions efficiently, usually through closure conversion. Handle partially applied functions efficiently.
Don't overlook the classic scalar and loop optimizations. They made a huge difference to compilers like TIL and Objective Caml.
If you have a lazy functional language like Haskell or Clean, there are also a lot of specialized things to do with thunks.
Footnotes:
One interesting option you get with total immutability is more ability to execute very fine-grained parallelism. The end of this story has yet to be told.
Writing a good compiler for F# is harder than writing a typical compiler (if there is such a thing) because F# is so heavily constrained: it must do the functional things well, but it must also work effectively within the .NET framework, which was not designed with functional languages in mind. We owe a tip of the hat to Don Syme and his team for doing such a great job on a heavily constrained problem.
No.
The F# compiler makes no attempt to analyze the referential transparency of a method or lambda. The .NET BCL is simply not designed for this.
The F# language specification does reserve the keyword 'pure', so manually marking a method as pure may be possible in vNext, allowing more aggressive graph reduction of lambda-expressions.
However, if you use the either record or algebraic types, F# will create default comparison and equality operators, and provide copy semantics. Amongst many other benefits (pattern-matching, closed-world assumption) this reduces a significant burden!
Yes, if you don't consider F#, but consider Haskell for instance. The fact that there are no side effects really opens up a lot of possibilities for optimization.
For instance consider in a C like language:
int factorial(int n) {
if (n <= 0) return 1;
return n* factorial(n-1);
}
int factorialuser(int m) {
return factorial(m) * factorial(m);
}
If a corresponding method was written in Haskell, there would be no second call to factorial when you call factorialuser. It might be possible to do this in C#, but I doubt the current compilers do it, even for a simple example as this. As things get more complicated, it would be hard for C# compilers to optimize to the level Haskell can do.
Note, F# is not really a "pure" functional language, currently. So, I brought in Haskell (which is great!).
Unfortunately, because F# is only mostly pure there aren't really that many opportunities for aggressive optimization. In fact, there are some places where F# "pessimizes" code compared to C# (e.g. making defensive copies of structs to prevent observable mutation). On the bright side, the compiler does a good job overall despite this, providing comparable performace to C# in most places nonetheless while simultaneously making programs easier to reason about.
I would say largely 'no'.
The main 'optimization' advantages you get from immutability or referential transparency are things like the ability to do 'common subexpression elimination' when you see code like ...f(x)...f(x).... But such analysis is hard to do without very precise information, and since F# runs on the .Net runtime and .Net has no way to mark methods as pure (effect-free), it requires a ton of built-in information and analysis to even try to do any of this.
On the other hand, in a language like Haskell (which mostly means 'Haskell', as there are few languages 'like Haskell' that anyone has heard of or uses :)) that is lazy and pure, the analysis is simpler (everything is pure, go nuts).
That said, such 'optimizations' can often interact badly with other useful aspects of the system (performance predictability, debugging, ...).
There are often stories of "a sufficiently smart compiler could do X", but my opinion is that the "sufficiently smart compiler" is, and always will be, a myth. If you want fast code, then write fast code; the compiler is not going to save you. If you want common subexpression elimination, then create a local variable (do it yourself).
This is mostly my opinion, and you're welcome to downvote or disagree (indeed I've heard 'multicore' suggested as a rising reason that potentially 'optimization may get sexy again', which sounds plausible on the face of it). But if you're ever hopeful about any compiler doing any non-trivial optimization (that is not supported by annotations in the source code), then be prepared to wait a long, long time for your hopes to be fulfilled.
Don't get me wrong - immutability is good, and is likely to help you write 'fast' code in many situations. But not because the compiler optimizes it - rather, because the code is easy to write, debug, get correct, parallelize, profile, and decide which are the most important bottlenecks to spend time on (possibly rewriting them mutably). If you want efficient code, use a development process that let you develop, test, and profile quickly.
Additional optimizations for functional languages are sometimes possible, but not necessarily because of immutability. Internally, many compilers will convert code into an SSA (single static assignment) form, where each local variable inside a function can only be assigned once. This can be done for both imperative and functional languages. For instance:
x := x + 1
y := x + 4
can become
x_1 := x_0 + 1
y := x_1 + 4
where x_0 and x_1 are different variable names. This vastly simplifies many transformations, since you can move bits of code around without worrying about what value they have at specific points in the program. This doesn't work for values stored in memory though (i.e., globals, heap values, arrays, etc). Again, this is done for both functional and imperative languages.
One benefit most functional languages provide is a strong type system. This allows the compiler to make assumptions that it wouldn't be able to otherwise. For instance, if you have two references of different types, the compiler knows that they cannot alias (point to the same thing). This is not an assumption a C compiler could ever make.

C# - Default library has better performance?

Earlier today i made myself a lightweight memory stream, which basically writes to a byte array. I thought i'd benchmark the two of them to see if there's any difference - And there was:
(writing 1 byte to the array)
MemoryStream: 1.0001ms
mine: 3.0004ms
Everyone tells me that MemoryStream basically provides a byte array and a bunch of methods to work with it.
My question: Does the default C# library have a slightly better performance than the code we write? (maybe it runs in release rather than debug?)
The .NET implementation was probably a bit better than your own, but also, how did you benchmark? A couple of million iterations, or just a few? Remember that you need to use a large test base so that you can eliminate some data (CPU being called away for a moment, etc) that will give false results.
The folks at Microsoft are much smarter than you and I and most likely have written a better optimized wrapper over Byte[], much better than something that you or I would implement.
If you are curious, I would suggest that you disassemble the types that you have recreated to see how exactly Microsoft has implemented them. In some of the more important areas of the framework (such as this I would imagine) you will find that the BCL calls out to unmanaged code to accomplish its goals.
Unmanaged code has a much better chance of outperforming managed code in cases like this since you can freely work with arrays without the overhead of a managed runtime (for things like bounds checking and such).
Many of the framework assemblies are NGENed, which may give them a small boost by bypassing the initial JIT time. This is unlikely to be the cause of a 2ms difference, especially if you'd already warmed up your methods before starting the stopwatch, but I mention it for completeness.
Also, yes, the framework assemblies are built in "release" mode (optimisations on and checks off), not "debug."
You probably used Array.Copy() instead of the faster Buffer.BlockCopy(). The fastest way is to use unsafe code with pointers. Check out how they do this in the Mono project (search for memcpy).
Id wager that Microsoft's implementation is a wee bit better than yours. ;)
Did you check the source?

Categories