Thread Affinity

Thread Affinity - c#

I have a multithreaded program which consist of a C# interop layer over C++ code.
I am setting threads affinity (like in this post) and it works on part of my code, however on second part it doesn't work. Can Intel Compiler / IPP / MKL libs / inline assembly interfere with external affinity setting?
UPDATE:
I can't post code as it is whole environment with many many dlls. I set environment values: OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 IPP_NUM_THREADS=1. When it runs in single thread, it runs ok, but when I use number of C# threads and set affinity per thread (on a quad core machine), the initialization is going fine on separate cores, but during processing all threads start using the same core. Hope I am clear enough.
Thanks.

We've had this exact problem; we'd set our thread affinity to what we wanted, and the IPP/MKL functions would blow that away! The answer to your question is 'yes'.
Auto Parallelism
The issue is that, by default, the Intel libraries like to automatically execute multi-threaded versions of the routines. So, a single FFT gets computed by a number of threads setup by the library specifically for this purpose.
Intel's intent is that the programmer could get on with the job of writing a single threaded application, and the library would allow that single thread to benefit from a multicore processor by creating a number of threads for the maths work. A noble intent (your source code then need know nothing about the runtime hardware to extract the best achievable performance - handy sometimes), but a right bloody nuisance when one is doing one's own threading for one's own reasons.
Controlling the Library's Behaviour
Take a look at these Intel docs, section Support Functions / Threading Support Functions. You can either programmatically control the library's threading tendancies, or there's some environment variables you can set (like MKL_NUM_THREADS) before your program runs. Setting the number of threads was (as far as I recall) enough to stop the library doing its own thing.
Philosophical Essay Inspired By Answering Your Question (best ignored)
More or less everything Intel is doing in CPU design and software (e.g. IPP/MKL) is aimed at making it unnecessary for the programmer to Worry About Threads. You want good math performance? Use MKL. You want that for loop to go fast? Turn on Auto Parallelisation in ICC. You want to make the best use of cache? That's what Hyperthreading is for.
It's not a bad approach, and personally speaking I think that they've done a pretty good job. AMD too. Their architectures are quite good at delivering good real world performance improvements to the "Average Programmer" for the minimal investment in learning, re-writing and code development.
Irritation
However, the thing that irritates me a little bit (though I don't want to appear ungrateful!) is that whilst this approach works for the majority of programmers out there (which is where the profitable market is), it just throws more obstacles in the way of those programmers who want to spin their own parallelism. I can't blame Intel for that of course, they've done exactly the right thing; they're a market led company, they need to make things that will sell.
By offering these easy features the situation of there being too many under skilled and under trained programmers becomes more entrenched. If all programmers can get good performance without having to learn what auto parallelism is actually doing, then we'll never move on. The pool of really good programmers who actually know that stuff will remain really small.
Problem
I see this as a problem (though only a small one, I'll explain later). Computing needs to become more efficient for both economic and environmental reasons. Intel's approach allows for increased performance, and better silicon manufacturing techniques produces lower power consumption, but I always feel like it's not quite as efficient as it could be.
Example
Take the Cell processor at the heart of the PS3. It's something that I like to witter on about endlessly! However, IBM developed that with a completely different philosophy to Intel. They gave you no cache (just some fast static RAM instead to use as you saw fit), the architecture was pretty much pure NUMA, you had to do all your own parallelisation, etc. etc. The result was that if you really knew what you were doing you could get about 250GFLOPS out of the thing (I think the none-PS3 variants went to 320GLOPS), for 80Watts, all the way back in 2005.
It's taken Intel chips about another 6 or 7 years or so for a single device to get to that level of performance. That's a lot of Moores law growth. If the Cell got manufactured on Intel's latest silicon fab and was given as many transistors as Intel put into their big Xeons, it would still blow everything else away.
No Market
However, apart from PS3, Cell was a none-starter market proposition. IBM decided that it would never be a big enough seller to be worth their while. There just wasn't enough programmers out there who could really use it, and to indulge the few of us who could makes no commercial sense, which wouldn't please the shareholders.
Small Problem, Bigger Problem
I said earlier that this was only a small problem. Well, most of the world's computing isn't about high maths performance, it's become Facebook, Twitter, etc. That sort is all about I/O performance, and for that you don't need high maths performance. So in that sense the dependence on Intel Doing Everything For You so that the average programmer to get good math performance matters very little. There's just not enough maths being done to warrant a change in design philosophy.
In fact, I strongly suspect that the world will ultimately decide that you don't need a large chip at all, an ARM should do just fine. If that does come to pass then the market for Intel's very large chips with very good general purpose maths compute performance will vanish. Effectively those of use who want good maths performance are being heavily subsidised by those who want to fill enourmous data centres with Intel based hardware and put Intel PCs on every desktop.
We're simply lucky that Intel apparently has a desire to make sure that every big CPU they build is good at maths regardless of whether most of their users actually use that maths performance. I'm sure that desire has its foundations in marketing prowess and wanting the bragging rights, but those are not hard, commercially tangible artifacts that bring shareholder value.
So if those data centre guys decide that, actually, they'd rather save electricity and fill their data centres with ARMs, where does that leave Intel? ARMs are fine devices for the purpose for which they're intended, but they're not at the top of my list of Supercomputer chips. So where does that leave us?
Trend
My take on the current market trend is that 'Workstations' (PCs as we call them now) are going to start costing lots and lots of money, just like they did in the 1980s / early 90s.
I think that better supercomputers will become unaffordable because no one can spare the $10billions it would take to do the next big chip. If people stop having PCs there won't be a mass market for large all-out GPUs, so we won't even be able to use those instead. They're an exclusive thing, but super computers do play a vital role in our world and we do need them to get better. So who is going to pay for that? Not me, that's for sure.
Oops, that went on for quite a while...

Related

Multi-Threading or GPU calculations

I'm currently attempting to create a Bitcoin Miner written in C# XNA.
https://github.com/Generalkidd/XNAMiner
Now the problem is, the actual number crunching of the Miner seems to be taking up too much CPU time and therefore, the UI of the program pretty much freezes at launch, although I do believe the calculations are still happening in the background despite the window being frozen and unresponsive. I tried implementing Aphid's ParallelTasks library and migrated some of the for-loops into a different thread. I didn't fully understand how these parallel for-loops worked and thus the version I created did not iterate correctly, however, the program did speed up a lot. There were still a couple for-loops left as well as a bunch of foreach loops.
What's the easiest and most efficient way to optimize my code? Should I try moving each loop into its own thread? Or try moving entire methods into their own threads? Or would it be possible to use the GPU for these calculations (it'd ultimately be better that way given the state of CPU mining).

I hope you are doing this as an educational effort, as CPU/GPU mining has become obsolete since 2011. You can barely make your hardware investment even with free electricity. ASICs are the new thing for mining now.
Different GPU/CPU hash rates
Mining Calculator

When is optimization premature? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I see this term used a lot but I feel like most people use it out of laziness or ignorance. For instance, I was reading this article:
http://blogs.msdn.com/b/ricom/archive/2006/09/07/745085.aspx
where he talks about his decisions he makes to implement the types necessary for his app.
If it was me, talking about these for code that we need to write, other programmers would think either:
I am thinking way too much ahead when there is nothing and thus prematurely optimizing.
Over-thinking insignificant details when there is no slowdowns or performance problems experienced.
or both.
and would suggest to just implement it and not worry about these until they become a problem.
Which is more preferential?
How to make the differentiation between premature optimization vs informed decision making for a performance critical application before any implementation is done?

Optimization is premature if:
Your application isn't doing anything time-critical. (Which means, if you're writing a program that adds up 500 numbers in a file, the word "optimization" shouldn't even pop into your brain, since all it'll do is waste your time.)
You're doing something time-critical in something other than assembly, and still worrying whether i++; i++; is faster or i += 2... if it's really that critical, you'd be working in assembly and not wasting time worrying about this. (Even then, this particular example most likely won't matter.)
You have a hunch that one thing might be a bit faster than the other, but you need to look it up. For example, if something is bugging you about whether StopWatch is faster or Environment.TickCount, it's premature optimization, since if the difference was bigger, you'd probably be more sure and wouldn't need to look it up.
If you have a guess that something might be slow but you're not too sure, just put a //NOTE: Performance? comment, and if you later run into bottlenecks, check such places in your code. I personally don't worry about optimizations that aren't too obvious; I just use a profiler later, if I need to.
Another technique:
I just run my program, randomly break into it with the debugger, and see where it stopped -- wherever it stops is likely a bottleneck, and the more often it stops there, the worse the bottleneck. It works almost like magic. :)

This proverb does not (I believe) refer to optimizations that are built into a good design as it is created. It refers to tasks specifically targeted at performance, which otherwise would not be undertaken.
This kind of optimization does not "become" premature, according to the common wisdom — it is guilty until proven innocent.

Optimisation is the process of making existing code run more efficiently (faster speed, and/or less resource usage)
All optimisation is premature if the programmer has not proven that it is necessary. (For example, by running the code to determine if it achieves the correct results in an acceptable timeframe. This could be as simple as running it to "see" if it runs fast enough, or running under a profiler to analyze it more carefully).
There are several stages to programming something well:
1) Design the solution and pick a good, efficient algorithm.
2) Implement the solution in a maintainable, well coded manner.
3) Test the solution and see if it meets your requirements on speed, RAM usage, etc. (e.g. "When the user clicks "Save", does it take less than 1 second?" If it takes 0.3s, you really don't need to spend a week optimising it to get that time down to 0.2s)
4) IF it does not meet the requirements, consider why. In most cases this means go to step (1) to find a better algorithm now that you understand the problem better. (Writing a quick prototype is often a good way of exploring this cheaply)
5) IF it still does not meet the requirements, start considering optimisations that may help speed up the runtime (for example, look-up tables, caching, etc). To drive this process, profiling is usually an important tool to help you locate the bottle-necks and inefficiences in the code, so you can make the greatest gain for the time you spend on the code.
I should point out that an experienced programmer working on a reasonably familiar problem may be able to jump through the first steps mentally and then just apply a pattern, rather than physically going through this process every time, but this is simply a short cut that is gained through experience
Thus, there are many "optimisations" that experienced programmers will build into their code automatically. These are not "premature optimisations" so much as "common-sense efficiency patterns". These patterns are quick and easy to implement, but vastly improve the efficiency of the code, and you don't need to do any special timing tests to work out whether or not they will be of benefit:
Not putting unnecessary code into loops. (Similar to the optimisation of removing unnecessary code from existing loops, but it doesn't involve writing the code twice!)
Storing intermediate results in variables rather than re-calculating things over and over.
Using look-up tables to provide precomputed values rather than calculating them on the fly.
Using appropriate-sized data structures (e.g. storing a percentage in a byte (8 bits) rather than a long (64 bits) will use 8 times less RAM)
Drawing a complex window background using a pre-drawn image rather than drawing lots of individual components
Applying compression to packets of data you intend to send over a low-speed connection to minimise the bandwidth usage.
Drawing images for your web page in a style that allows you to use a format that will get high quality and good compression.
And of course, although it's not technically an "optmisation", choosing the right algorithm in the first place!
For example, I just replaced an old piece of code in our project. My new code is not "optimised" in any way, but (unlike the original implementation) it was written with efficiency in mind. The result: Mine runs 25 times faster - simply by not being wasteful. Could I optimise it to make it faster? Yes, I could easily get another 2x speedup. Will I optimise my code to make it faster? No - a 5x speed improvement would have been sufficient, and I have already achieved 25x. Further work at this point would just be a waste of precious programming time. (But I can revisit the code in future if the requirements change)
Finally, one last point: The area you are working in dictates the bar you must meet. If you are writing a graphics engine for a game or code for a real-time embedded controller, you may well find yourself doing a lot of optimisation. If you are writing a desktop application like a notepad, you may never need to optimise anything as long as you aren't overly wasteful.

When starting out, just delivering a product is more important than optimizing.
Over time you are going to profile various applications and will learn coding skills that will naturally lead to optimized code. Basically at some point you'll be able to spot potential trouble spots and build things accordingly.
However don't sweat it until you've found an actual problem.

Premature optimization is making an optimization for performance at the cost of some other positive attribute of your code (e.g. readability) before you know that it is necessary to make this tradeoff.
Usually premature optimizations are made during the development process without using any profiling tools to find bottlenecks in the code. In many cases the optimization will make the code harder to maintain and sometimes also increases the development time, and therefore the cost of the software. Worse... some premature optimizations turn out not to be make the code any faster at all and in some cases can even make the code slower than it was before.

When you have less that 10 years of coding experience.

Having (lots of) experience might be a trap. I know many very experienced programmers (C\C++, assembly) who tend to worry too much because they are used to worry about clock ticks and superfluous bits.
There are areas such as embedded or realtime systems where these do count but in regular OLTP/LOB apps most of your effort should be directed towards maintainability, readability and changeabilty.

Optimization is tricky. Consider the following examples:
Deciding on implementing two servers, each doing its own job, instead of implementing a single server that will do both jobs.
Deciding to go with one DBMS and not another, for performance reasons.
Deciding to use a specific, non-portable API when there is a standard (e.g., using Hibernate-specific functionality when you basically need the standard JPA), for performance reasons.
Coding something in assembly for performance reasons.
Unrolling loops for performance reasons.
Writing a very fast but obscure piece of code.
My bottom line here is simple. Optimization is a broad term. When people talk about premature optimization, they don't mean you need to just do the first thing that comes to mind without considering the complete picture. They are saying you should:
Concentrate on the 80/20 rule - don't consider ALL the possible cases, but the most probable ones.
Don't over-design stuff without any good reason.
Don't write code that is not clear, simple and easily maintainable if there is no real, immediate performance problem with it.
It really all boils down to your experience. If you are an expert in image processing, and someone requests you do something you did ten times before, you will probably push all your known optimizations right from the beginning, but that would be ok. Premature optimization is when you're trying to optimize something when you don't know it needs optimization to begin with. The reason for that is simple - it's risky, it's wasting your time, and it will be less maintainable. So unless you're experienced and you've been down that road before, don't optimize if you don't know there's a problem.

Note that optimization is not free (as in beer)
it takes more time to write
it takes more time to read
it takes more time to test
it takes more time to debug
...
So before optimizing anything, you should be sure it's worth it.
That Point3D type you linked to seems like the cornerstone of something, and the case for optimization was probably obvious.
Just like the creators of the .NET library didn't need any measurements before they started optimizing System.String. They would have to measure during though.
But most code does not play a significant role in the performance of the end product. And that means any effort in optimization is wasted.
Besides all that, most 'premature optimizations' are untested/unmeasured hacks.

Optimizations are premature if you spend too much time designing those during the earlier phases of implementation. During the early stages, you have better things to worry about: getting core code implemented, unit tests written, systems talking to each other, UI, and whatever else. Optimizing comes with a price, and you might well be wasting time on optimizing something that doesn't need to be, all the while creating code that is harder to maintain.
Optimizations only make sense when you have concrete performance requirements for your project, and then performance will matter after the initial development and you have enough of your system implemented in order to actually measure whatever it is you need to measure. Never optimize without measuring.
As you gain more experience, you can make your early designs and implementations with a small eye towards future optimizations, that is, try to design in such a way that will make it easier to measure performance and optimize later on, should that even be necessary. But even in this case, you should spend little time on optimizations in the early phases of development.

Multicore programming: the hard parts

I'm writing a book on multicore programming using .NET 4 and I'm curious to know what parts of multicore programming people have found difficult to grok or anticipate being difficult to grok?

What's a useful unit of work to parallelize, and how do I find/organize one?
All these parallelism primitives aren't helpful if you fork a piece of work that is smaller than the forking overhead; in fact, that buys you a nice slowdown instead of what you are expecting.
So one of the big problems is finding units of work that are obviously more expensive than the parallelism primitives. A key problem here is that nobody knows what anything costs to execute, including the parallelism primitives themselves. Clearly calibrating these costs would be very helpful. (As an aside, we designed, implemented, and daily use a parallel programming langauge, PARLANSE whose objective was to minimize the cost of the parallelism primitives by allowing the compiler to generate and optimize them, with the goal of making smaller bits of work "more parallelizable").
One might also consider discussion big-Oh notation and its applications. We all hope that the parallelism primitives have cost O(1). If that's the case, then if you find work with cost O(x) > O(1) then that work is a good candidate for parallelization. If your proposed work is also O(1), then whether it is effective or not depends on the constant factors and we are back to calibration as above.
There's the problem of collecting work into large enough units, if none of the pieces are large enough. Code motion, algorithm replacement, ... are all useful ideas to achieve this effect.
Lastly, there's the problem of synchnonization: when do my parallel units have to interact, what primitives should I use, and how much do those primitives cost? (More than you expect!).

I guess some of it depends on how basic or advanced the book/audience is. When you go from single-threaded to multi-threaded programming for the first time, you typically fall off a huge cliff (and many never recover, see e.g. all the muddled questions about Control.Invoke).
Anyway, to add some thoughts that are less about the programming itself, and more about the other related tasks in the software process:
Measuring: deciding what metric you are aiming to improve, measuring it correctly (it is so easy to accidentally measure the wrong thing), using the right tools, differentiating signal versus noise, interpreting the results and understanding why they are as they are.
Testing: how to write tests that tolerate unimportant non-determinism/interleavings, but still pin down correct program behavior.
Debugging: tools, strategies, when "hard to debug" implies feedback to improve your code/design and better partition mutable state, etc.
Physical versus logical thread affinity: understanding the GUI thread, understanding how e.g. an F# MailboxProcessor/agent can encapsulate mutable state and run on multiple threads but always with only a single logical thread (one program counter).
Patterns (and when they apply): fork-join, map-reduce, producer-consumer, ...
I expect that there will be a large audience for e.g. "help, I've got a single-threaded app with 12% CPU utilization, and I want to learn just enough to make it go 4x faster without much work" and a smaller audience for e.g. "my app is scaling sub-linearly as we add cores because there seems to be contention here, is there a better approach to use?", and so a bit of the challenge may be serving each of those audiences.

Since you write a whole book for multi-core programming in .Net.
I think you can also go beyond multi-core a little bit.
For example, you can use a chapter talking about parallel computing in a distributed system in .Net. Unlikely, there is no mature frameworks in .Net yet. DryadLinq is the closest. (On the other side, Hadoop and its friends in Java platform are really good.)
You can also use a chapter demonstrating some GPU computing stuff.

One thing that has tripped me up is which approach to use to solve a particular type of problem. There's agents, there's tasks, async computations, MPI for distribution - for many problems you could use multiple methods but I'm having difficulty understanding why I should use one over another.

To understand: low level memory details like the difference between acquire and release semantics of memory.
Most of the rest of the concepts and ideas (anything can interleave, race conditions, ...) are not that difficult with a little usage.
Of course the practice, especially if something is failing sometimes, is very hard as you need to work at multiple levels of abstraction to understand what is going on, so keep your design simple and as far as possible design out the need for locking etc. (e.g. using immutable data and higher level abstractions).

Its not so much theoretical details, but more the practical implementation details which trips people up.
What's the deal with immutable data structures?
All the time, people try to update a data structure from multiple threads, find it too hard, and someone chimes in "use immutable data structures!", and so our persistent coder writes this:
ImmutableSet set;
ThreadLoop1()
foreach(Customer c in dataStore1)
set = set.Add(ProcessCustomer(c));
ThreadLoop2()
foreach(Customer c in dataStore2)
set = set.Add(ProcessCustomer(c));
Coder has heard all their lives that immutable data structures can be updated without locking, but the new code doesn't work for obvious reasons.
Even if your targeting academics and experienced devs, a little primer on the basics of immutable programming idioms can't hurt.
How to partition roughly equal amounts of work between threads?
Getting this step right is hard. Sometimes you break up a single process into 10,000 steps which can be executed in parallel, but not all steps take the same amount of time. If you split the work on 4 threads, and the first 3 threads finish in 1 second, and the last thread takes 60 seconds, your multithreaded program isn't much better than the single-threaded version, right?
So how do you partition problems with roughly equal amounts of work between all threads? Lots of good heuristics on solving bin packing problems should be relevant here..
How many threads?
If your problem is nicely parallelizable, adding more threads should make it faster, right? Well not really, lots of things to consider here:
Even a single core processor, adding more threads can make a program faster because more threads gives more opportunities for the OS to schedule your thread, so it gets more execution time than the single-threaded program. But with the law of diminishing returns, adding more threads increasing context-switching, so at a certain point, even if your program has the most execution time the performance could still be worse than the single-threaded version.
So how do you spin off just enough threads to minimize execution time?
And if there are lots of other apps spinning up threads and competing for resources, how do you detect performance changes and adjust your program automagically?

I find the conceptions of synchronized data moving across worker nodes in complex patterns very hard to visualize and program.
Usually I find debugging to be a bear, also.

Tips for optimizing C#/.NET programs [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
It seems like optimization is a lost art these days. Wasn't there a time when all programmers squeezed every ounce of efficiency from their code? Often doing so while walking five miles in the snow?
In the spirit of bringing back a lost art, what are some tips that you know of for simple (or perhaps complex) changes to optimize C#/.NET code? Since it's such a broad thing that depends on what one is trying to accomplish it'd help to provide context with your tip. For instance:
When concatenating many strings together use StringBuilder instead. See link at the bottom for caveats on this.
Use string.Compare to compare two strings instead of doing something like string1.ToLower() == string2.ToLower()
The general consensus so far seems to be measuring is key. This kind of misses the point: measuring doesn't tell you what's wrong, or what to do about it if you run into a bottleneck. I ran into the string concatenation bottleneck once and had no idea what to do about it, so these tips are useful.
My point for even posting this is to have a place for common bottlenecks and how they can be avoided before even running into them. It's not even necessarily about plug and play code that anyone should blindly follow, but more about gaining an understanding that performance should be thought about, at least somewhat, and that there's some common pitfalls to look out for.
I can see though that it might be useful to also know why a tip is useful and where it should be applied. For the StringBuilder tip I found the help I did long ago at here on Jon Skeet's site.

It seems like optimization is a lost art these days.
There was once a day when manufacture of, say, microscopes was practiced as an art. The optical principles were poorly understood. There was no standarization of parts. The tubes and gears and lenses had to be made by hand, by highly skilled workers.
These days microscopes are produced as an engineering discipline. The underlying principles of physics are extremely well understood, off-the-shelf parts are widely available, and microscope-building engineers can make informed choices as to how to best optimize their instrument to the tasks it is designed to perform.
That performance analysis is a "lost art" is a very, very good thing. That art was practiced as an art. Optimization should be approached for what it is: an engineering problem solvable through careful application of solid engineering principles.
I have been asked dozens of times over the years for my list of "tips and tricks" that people can use to optimize their vbscript / their jscript / their active server pages / their VB / their C# code. I always resist this. Emphasizing "tips and tricks" is exactly the wrong way to approach performance. That way leads to code which is hard to understand, hard to reason about, hard to maintain, that is typically not noticably faster than the corresponding straightforward code.
The right way to approach performance is to approach it as an engineering problem like any other problem:
Set meaningful, measurable, customer-focused goals.
Build test suites to test your performance against these goals under realistic but controlled and repeatable conditions.
If those suites show that you are not meeting your goals, use tools such as profilers to figure out why.
Optimize the heck out of what the profiler identifies as the worst-performing subsystem. Keep profiling on every change so that you clearly understand the performance impact of each.
Repeat until one of three things happens (1) you meet your goals and ship the software, (2) you revise your goals downwards to something you can achieve, or (3) your project is cancelled because you could not meet your goals.
This is the same as you'd solve any other engineering problem, like adding a feature -- set customer focused goals for the feature, track progress on making a solid implementation, fix problems as you find them through careful debugging analysis, keep iterating until you ship or fail. Performance is a feature.
Performance analysis on complex modern systems requires discipline and focus on solid engineering principles, not on a bag full of tricks that are narrowly applicable to trivial or unrealistic situations. I have never once solved a real-world performance problem through application of tips and tricks.

Get a good profiler.
Don't bother even trying to optimize C# (really, any code) without a good profiler. It actually helps dramatically to have both a sampling and a tracing profiler on hand.
Without a good profiler, you're likely to create false optimizations, and, most importantly, optimize routines that aren't a performance problem in the first place.
The first three steps to profiling should always be 1) Measure, 2) measure, and then 3) measure....

Optimization guidelines:
Don't do it unless you need to
Don't do it if it's cheaper to throw new hardware at the problem instead of a developer
Don't do it unless you can measure the changes in a production-equivalent environment
Don't do it unless you know how to use a CPU and a Memory profiler
Don't do it if it's going to make your code unreadable or unmaintainable
As processors continue to get faster the main bottleneck in most applications isn't CPU, it's bandwidth: bandwidth to off-chip memory, bandwidth to disk and bandwidth to net.
Start at the far end: use YSlow to see why your web site is slow for end-users, then move back and fix you database accesses to be not too wide (columns) and not too deep (rows).
In the very rare cases where it's worth doing anything to optimize CPU usage be careful that you aren't negatively impacting memory usage: I've seen 'optimizations' where developers have tried to use memory to cache results to save CPU cycles. The net effect was to reduce the available memory to cache pages and database results which made the application run far slower! (See rule about measuring.)
I've also seen cases where a 'dumb' un-optimized algorithm has beaten a 'clever' optimized algorithm. Never underestimate how good compiler-writers and chip-designers have become at turning 'inefficient' looping code into super efficient code that can run entirely in on-chip memory with pipelining. Your 'clever' tree-based algorithm with an unwrapped inner loop counting backwards that you thought was 'efficient' can be beaten simply because it failed to stay in on-chip memory during execution. (See rule about measuring.)

When working with ORMs be aware of N+1 Selects.
List<Order> _orders = _repository.GetOrders(DateTime.Now);
foreach(var order in _orders)
{
Print(order.Customer.Name);
}
If the customers are not eagerly loaded this could result in several round trips to the database.

Don't use magic numbers, use enumerations
Don't hard-code values
Use generics where possible since it's typesafe & avoids boxing & unboxing
Use an error handler where it's absolutely needed
Dispose, dispose, dispose. CLR wound't know how to close your database connections, so close them after use and dispose of unmanaged resources
Use common-sense!

OK, I have got to throw in my favorite: If the task is long enough for human interaction, use a manual break in the debugger.
Vs. a profiler, this gives you a call stack and variable values you can use to really understand what's going on.
Do this 10-20 times and you get a good idea of what optimization might really make a difference.

If you identify a method as a bottleneck, but you don't know what to do about it, you are essentially stuck.
So I'll list a few things. All of these things are not silver bullets and you will still have to profile your code. I'm just making suggestions for things you could do and can sometimes help. Especially the first three are important.
Try solving the problem using just (or: mainly) low-level types or arrays of them.
Problems are often small - using a smart but complex algorithm does not always make you win, especially if the less-smart algorithm can be expressed in code that only uses (arrays of) low level types. Take for example InsertionSort vs MergeSort for n<=100 or Tarjan's Dominator finding algorithm vs using bitvectors to naively solve the data-flow form of the problem for n<=100. (the 100 is of course just to give you some idea - profile!)
Consider writing a special case that can be solved using just low-level types (often problem instances of size < 64), even if you have to keep the other code around for larger problem instances.
Learn bitwise arithmetic to help you with the two ideas above.
BitArray can be your friend, compared to Dictionary, or worse, List. But beware that the implementation is not optimal; You can write a faster version yourself. Instead of testing that your arguments are out of range etc., you can often structure your algorithm so that the index can not go out of range anyway - but you can not remove the check from the standard BitArray and it is not free.
As an example of what you can do with just arrays of low level types, the BitMatrix is a rather powerful structure that can be implemented as just an array of ulongs and you can even traverse it using an ulong as "front" because you can take the lowest order bit in constant time (compared with the Queue in Breadth First Search - but obviously the order is different and depends on the index of the items rather than purely the order in which you find them).
Division and modulo are really slow unless the right hand side is a constant.
Floating point math is not in general slower than integer math anymore (not "something you can do", but "something you can skip doing")
Branching is not free. If you can avoid it using a simple arithmetic (anything but division or modulo) you can sometimes gain some performance. Moving a branch to outside a loop is almost always a good idea.

People have funny ideas about what actually matters. Stack Overflow is full of questions about, for example, is ++i more "performant" than i++. Here's an example of real performance tuning, and it's basically the same procedure for any language. If code is simply written a certain way "because it's faster", that's guessing.
Sure, you don't purposely write stupid code, but if guessing worked, there would be no need for profilers and profiling techniques.

The truth is that there is no such thing as the perfect optimised code. You can, however, optimise for a specific portion of code, on a known system (or set of systems) on a known CPU type (and count), a known platform (Microsoft? Mono?), a known framework / BCL version, a known CLI version, a known compiler version (bugs, specification changes, tweaks), a known amount of total and available memory, a known assembly origin (GAC? disk? remote?), with known background system activity from other processes.
In the real world, use a profiler, and look at the important bits; usually the obvious things are anything involving I/O, anything involving threading (again, this changes hugely between versions), and anything involving loops and lookups, but you might be surprised at what "obviously bad" code isn't actually a problem, and what "obviously good" code is a huge culprit.

Tell the compiler what to do, not how to do it. As an example, foreach (var item in list) is better than for (int i = 0; i < list.Count; i++) and m = list.Max(i => i.value); is better than list.Sort(i => i.value); m = list[list.Count - 1];.
By telling the system what you want to do it can figure out the best way to do it. LINQ is good because its results aren't computed until you need them. If you only ever use the first result, it doesn't have to compute the rest.
Ultimately (and this applies to all programming) minimize loops and minimize what you do in loops. Even more important is to minimize the number of loops inside your loops. What's the difference between an O(n) algorithm and an O(n^2) algorithm? The O(n^2) algorithm has a loop inside of a loop.

I don't really try to optimize my code but at times I will go through and use something like reflector to put my programs back to source. It is interesting to then compare what I wrong with what the reflector will output. Sometimes I find that what I did in a more complicated form was simplified. May not optimize things but helps me to see simpler solutions to problems.

Low overhead (statistical) profiler for Win32 platform with managed capabilities

Anyone can recommend a LOP on Windows? Similar to Linux's OProfile or to OS X's Shark.
must be able to sample non-instrumented binaries
capable of resolving CLR stacks
preferable delayed PDB resolution of symbols
impact low enough to be able to get a decent reading on live, production systems

The Visual Studio Team Suite profiler is amazing. It's so good at its job that it makes me seem better at mine.
Redgate has a performance profiler and memory profiler which I haven't used.

Automated QA's AQTime has saved my butt. I used it to figure out a problem with a .NET web service calling some nasty old C code, and it did it well.

This is what I use. Although it is not suitable for live production use, it answers your other needs.
For live production use, you need something that samples the stack. In my opinion, it's OK if it has some small overhead. My goal is to discover the activities that need optimization, and for that I'm willing to pay a temporary price in speed.
There is always one or more intervals of interest, like the interval between when a request is received, and the response goes out. It's surprising how few samples you need in such an interval to find out what's taking the time.
High precision of timing is not needed. If there is something X going on that, through optimization, would save you, say, 50% of the interval, that is roughly the fraction of samples that will show you X.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.