Leveraging 64 bit architecture effectively when porting applications - c#

I'm considering porting a processor intensive 32 bit .NET 4 (C#) console application to 64 bit. It is a parallelized number crunching algorithm that runs for about five hours.
I know a bit about performance benefits but not sure how to change the code to take full advantage of 64 bit architecture. For example should I consider changing all Int32 types to Int64 etc.?
Where could I find some comprehensive resources to learn about optimization considerations when porting to x64? Searching around has come up mostly with general information about the x64 architecture.

There's little hope that your hard work will pay off. An x64 core ticks at the same pace as an x86 core, clock speeds are the same and constrained by the physics of silicon. One clear advantage an x64 core has is that it can whip 64-bits around at the same time, double the amount of an x86 core. But in practice there are few real world problems that actually have a need for that range. Most integer problems work just fine with a range of +/- two billion. It is easy to tell from your code, if you have a lot of long instead of int variables in your inner-most loops then you'll be ahead.
A significant disadvantage of a x64 core is that it consumes the available cache a lot quicker. That too is constrained by silicon, there's only so much memory they can fit on a chip. Any pointer is double the size, 8 bytes instead of 4 bytes in 32-bit mode. Pointers are used for any object reference. Cache is a very big deal on modern cores, the memory bus is glacially slow compared to the speed of the core. This disadvantage is balanced somewhat by x64 having a 8 more cpu registers (r8 through r15).
The net effect is that 32-bit code usually executes a bit faster than 64-bit code. You only get a benefit from 64-bit code when your program is bogged down by having to cram the data it processes in a 2 gigabyte address space. In other words, having to use files or memory-mapped files to avoid running out of memory. If your app is compute constrained, as suggested in your question, it is very unlikely you'll be ahead.

For pure .NET Code it seems you can't optimize yourself.
Compile to any cpu and the IL compiler will optimize to the x64 for things like extra memory space and a couple of registers more.
See How can compiling my application for 64-bit make it faster or better? for similar discussion

You did not mention whether your application is also memory intensive. However,
be aware that when making this transition, your applications memory footprint is expected to grow significantly. Since all the reference type pointers are now taking twice the space. In other words, although you have an effectively infinite address space, effective memory utilization is still limited by RAM.

Changing 32bit to 64bit types in itself doesn't help. 32bit operations are all as fast or faster than 64bit operations on x64. What's different is that 64bit operations are faster than in 32bit mode. So for example if you use an array of int32's and you can also use an array of int64's that's half as long, it's likely to help.

Related

Benchmarking RAM performance - UWP and C#

I'm developing a benchmarking application using Universal Windows Platform that evaluates CPU and RAM performance of a Windows 10 system.
Although I found different algorithms to benchmark a CPU, I still didn't found any solid algorithm or solution to evaluate the write and read speeds of memory.
How can I achieve this in C#?
Thanks in advance :)
I don't see why this would not be possible from managed code. Array access code turns into normal x86 memory instructions. It's a thin abstraction. In particular I don't see why you would need a customized OS.
You should be able to test sequential memory speed by performing memcpy on big arrays. They must be bigger than the last level cache size.
You can test random access by randomly indexing into a big array. The index calculation must be cheap, unpredictable and there must be a dependency chain that serializes the memory instructions so that the CPU cannot parallelize them.
Honestly I don't think its possible. RAM benchmarks usually run off of dedicated OS's
RAM testing is different from RAM benchmarking.
C# doesn't give you that kind of control over RAM
Of course, just new up a big array and access it. Also, understand the overheads that are present. The only overhead is a range check.
The GC has no impact during the benchmark. It might be triggered by an allocation.

CLR / High memory consumption after switching from 32-bit process to 64-bit process

I have a backend application (windows service) built on top of .NET Framework 4.5 (C#). The application runs on Windows Server 2008 R2 server, with 64GB of memory.
Due to dependencies I had, I used to compile and run this application as a 32-bit process (compile it as x86) and use /LARGEADDRESSAWARE flag to let the application use more than 2GB memory in the user space. Using this configuration, the average memory consumption (according to the "memory (private working set)" column in the task manager) was about 300-400MB.
The reason I needed the LARGEADDRESSAWARE flag, and the reason i changed it to 64-bit, is that although 300-400MB is the average, once in a while this app doing stuff that involves loading a lot of data into the memory (and it's much easier to develop and manage this kind of stuff when you're not very limited memory-wise).
Recently (after removing those x86 native dependencies), I changed the application compilation to "Any CPU", so now, on the production server, it runs as a 64-bit process. Starting when I did this change, the average memory consumption (according to the task manager) got to new levels: 3-4 GB, when there is no other change that may explain this change in behavior.
Here are some additional facts about the current state:
According to the "#Bytes in all heaps" counter, the total amount of memory is about 600MB.
When debugging the process with WinDbg+SOS, !dumpheap -stat showed that there are about 250-300MB free, but all the other object was much less than the total amount of memory the process used.
According to the GC performance counters, there are Gen0 collections on regular basis. In fact, the "% Time in GC" counter indicates that 10-20% in average of the time spent on GC (which makes sense given the nature of the application - a lot of allocations of information and data structures that are in use for short time).
I'm using Server GC in this app.
There is no memory problem on the server. It uses about 50-60% of the available memory (64GB).
My questions:
Why is a great difference between the memory allocated to the process (according to the task manager) and the actual size of the CLR heap (there is no un-managed code in the process that can explain this)?
Why is the 64-bit process takes more memory compared to the same process running as 32-bit process? even when considering that pointers takes twice the size, there's a big difference.
Can i do something to lower the memory consumption, or to have better understanding of the issue?
Thanks!
There are a few things to consider:
1) You mentioned you're using Server GC mode. In server GC mode, CLR creates one heap for every CPU core on the machine, which is more efficient more multi-threaded processing in server processes, e.g. Asp.Net processes. Each heap has two segment: one for small objects, one for large objects. Each segment starts with 4 gb reserved memory. Basically server GC mode tries to use more memory on the system to trade for overall system performance.
2) Pointer is bigger on 64-bit, of course.
3) Foreground Gen2 GC becomes super expensive in server GC mode due to heap is much larger. So CLR tries super hard to reduce the number of foreground Gen2 GC, sometimes using background Gen2 GC.
4) Depending on usage, fragmentation can become a real issue. I've seen heaps with 98% fragmentation (98% heap is free blocks).
To really solve your problem, you need to get an ETW trace + a memory dump, and then use tools like PerfView for detailed analysis.
A 64-bit process will naturally use 64-bit pointers, effectively doubling the memory usage of every reference. Certain platform-dependent variables such as IntPtr will also take up double the space.
The first and best thing you can do is to run a memory profiler to see where exactly the extra memory footprint is coming from. Anything else is speculative!

How to minimize memory usage in windows

I wrote a hello world program (windows application without any UI) in C#. The release-build excutable doesn't do anything but to Thread.Sleep(50000) //50 seconds.
I opened sysinternals (a profiler like task manager). This excutable ate 7MB memory (private bytes)!!
Can anybody explain what is happening and how to make the memory usage smaller.
P.S. I also tried to use NGEN to pre-compile the .exe but still got the same memory usage.
Thanks a lot
C# (and other JIT compiled/interpreted languages) usually end up doing a lot of things for you automatically in the background. While what you've written is simple, there's a lot of stuff going on in the background to support the JIT nature of the application.
That 7MB of memory is relatively small given 2GB of RAM is fairly commonplace these days. And it probably won't go up more unless you do something unusual or allocate lots of arrays and data structures.
If it's a Hello World based on the C# WindowsApplication project type, and there's an int main in Program.cs doing an application.run on a Windows Form, then not only is there a lot of JIT overhead, but there's a lot of Windows Forms overhead too.
End of the day, I'm sure everything is dandy.
.Net apps have a lot of basic overhead, but your memory increase should be relatively small after that point.
Example. My one app consumes 10MB of memory on load, but it only consumes 40MB of memory after loading 60k rows from a Database, each row containing multiple strings and many values.
A small fixed upfront value isn't that bad, on modern computers. Scalability is the primary issue now-a-days. .Net scales fairly well for being managed.

How big an effect on compile times does L2 cache size have?

I am in the middle of the decision process for a new developer workstation, and one remaining question is which processor to choose, and one of the early decisions is whether to go with Xeon or Core2 processors. (We've already restricted ourselves to HP machines, so we're only looking at Intel processors.)
The main goal of the upgrade is to shorten compile times as much as we can. We're using Visual Studio 2008 targeting .NET 3.5, mostly working on a solution with about a dozen projects. We know our build is CPU-bound. Since Visual Studio can't parallelize C# builds, we know we want to maximize CPU clock frequency - but the question is, do the larger caches of the Xeon line help during compilation, and if they do is the increase justifiable given the tripling in price?
You can add custom task to VS2008 in order for it to make build in parallel so the more processors (virtual) you have - the better. Take a look here. It helped me greatly.
I would guess that the compile process is more I/O-bound than CPU-bound. At least I could cut my compile time in half by putting my ASP.NET application on a RAM drive. (See here). As such, I would suggest not only thinking about the CPU but also about your disks, perhaps even more so.
I would really recommend that you measure this yourself. You're going to have loads of factors affecting performance e.g. are you compiling lots of small components, or one big deliverable (i.e. how CPU-bound will this be) ? And what disks are you specifying ? Memory ? All of this will make a difference, and it would be worth it to borrow some sample machines and test out your scenarios.
As for the question about cache size performance being 'worth it' - again - how much are you prepared to spend on compilation servers and how much is your time worth ? I suspect that if the servers are compiling more than a few hours a day and you have more than a couple of developers, the extra horsepower will more than pay for itself.
If I was you I would just go for the Q9550 with 12MB L2 cache :) They are currently good value for money.
I 'unfortunately' had to get a Core i7 860 due to my previous motherboard not supporting the FSB of the quadcore. I have no complaints though :)

Multiprocessor and Performance

I'm facing a really strange problem with a .Net service.
I developed a multithreaded x64 windows service.
I tested this service in a x64 server with 8 cores. The performance was great!
Now I moved the service to a production server (x64 - 32 cores). During the tests I found out the performance is, at least, 10 times worst than in the test server.
I've checked loads of performance counters trying to find some reason for this poor performance, but I couldn't find a point.
Could be a GC problem? Have you ever faced a problem like this?
Thank you in advance!
Alexandre
This is a common problem which people are generally unaware of, because very few people have experience on many-CPU machines.
The basic problem is contention.
As the CPU count increases, contention increases in all shared data structures. For low CPU counts, contention is low and the fact you have multiple CPUs improves performance. As the CPU count becomes significantly larger, contention begins to drown out your performance improvements; as the CPU count becomes large, contention actually starts reducing performance below that of a lower number of CPUs.
You are basically facing one of the aspects of the scalability problem.
I'm not sure however where this problem lies; in your data structures, or in the operating systems data structures. The former you can address - lock-free data structures are an excellent, highly scalable approach. The latter is difficult, since it essentially requires avoiding certain OS functionality.
There are way too many variables to know why one machine is slower than the other. 32 core machines are usually more specialized where an eight core could just be a dual proc quad core machine. Are there vm's or other things running at the same time? Usually with that many cores, IO bandwidth becomes the limiting factor (even if the cpu's still have plenty of bandwidth).
To start off, you should probably add lots of timers in your code (or profiling or whatever) to figure out what part of your code is taking up the most time.
Performance troublshooting 101: what is the bottleneck ( where in the code and what subsystem (memory, disk, cpu) )
There are so many factors here:
are you actually using the cores?
are your extra threads causing locking issues to be more obvious?
do you not have enough memory to support all the extra stacks / data you can process?
can your IO (disk/network/database) stack keep up with the throughput?
etc
Could it be down to differences in memory or the disk? If there were the bottleneck, you'd not get the value for the additional processing power. Can't really tell without more details of your application/configuration.
With that many threads running concurrently, you're going to have to be really careful to get around issues of threads fighting with each other to access your data. Read up on Non-blocking synchronization.
How many threads are you using? Using to many thread pool threads could cause thread starvation which would make your program slower.
Some articles:
http://www2.sys-con.com/ITSG/virtualcd/Dotnet/archives/0112/gomez/index.html
http://codesith.blogspot.com/2007/03/thread-starvation-in-shared-thread-pool.html
(search for thread starvation in them)
You could use a .net profiler to find your bottle necks, here are a good free one:
http://www.eqatec.com/tools/profiler
I agree with Blank, it's likely to be some form of contention. It's likely to be very hard to track down, unfortunately. It could be in your application code, the framework, the OS, or some combination thereof. Your application code is the most likely culprit, since Microsoft has expended significant effort on making the CLR and the OS scale on 32P boxes.
The contention could be in some hot locks, but it could be that some processor cache lines are sloshing back and forth between CPUs.
What's your metric for 10x worse? Throughput?
Have you tried booting the 32-proc box with fewer CPUs? Use the /NUMPROC option in boot.ini or BCDedit.
Do you achieve 100% CPU utilization? What's your context switch rate like? And how does this compare to the 8P box?

Categories