Benchmarking RAM performance - UWP and C#

Benchmarking RAM performance - UWP and C# - c#

I'm developing a benchmarking application using Universal Windows Platform that evaluates CPU and RAM performance of a Windows 10 system.
Although I found different algorithms to benchmark a CPU, I still didn't found any solid algorithm or solution to evaluate the write and read speeds of memory.
How can I achieve this in C#?
Thanks in advance :)

I don't see why this would not be possible from managed code. Array access code turns into normal x86 memory instructions. It's a thin abstraction. In particular I don't see why you would need a customized OS.
You should be able to test sequential memory speed by performing memcpy on big arrays. They must be bigger than the last level cache size.
You can test random access by randomly indexing into a big array. The index calculation must be cheap, unpredictable and there must be a dependency chain that serializes the memory instructions so that the CPU cannot parallelize them.
Honestly I don't think its possible. RAM benchmarks usually run off of dedicated OS's
RAM testing is different from RAM benchmarking.
C# doesn't give you that kind of control over RAM
Of course, just new up a big array and access it. Also, understand the overheads that are present. The only overhead is a range check.
The GC has no impact during the benchmark. It might be triggered by an allocation.

Related

C# Excessive Garbage Collection - Large Strings, G2 pressure?

I'm writing a high-ish volume web service in C# running in 64-bit IIS on Win 2k8 (.NET 4.5) that works with XML payloads and does a variety of operations on small and large objects (where the large objects are mainly strings, some over 85k (so going onto the LOH)). Requests are stateless, and memory usage remains steady over time. Lots of memory is being allocated and released per request, no memory appears to be being leaked.
Operating at a maximum of 25 transactions per second, with an average call lasting 5s, it's spending 40-60% of it's time in GC according to two profiling tools, and perfmon shows a steady 20 G0 and G1 collections over 5 seconds, and 15 G2 collections over 5 seconds - meaning lots of (we think) premature promtion into G2 for data that we'd expect to stay in G0. Everything I read indicates this is very excessive. We expect that the system should be able to perform at a higher throughput than 25 tps and assume the GC activity is preventing this.
The machines serving the requests have lots of memory - 16GB - and the application, under load, consumes at most 1GB when under load for an hour. I understand that a bigger heap won't necessarily make things better, but there is spare memory.
I appreciate this is light on specifics (will try to recreate the conditions with a trivial application if time permits) - but can anyone explain why we see so much G2 GC activity? Should I be focusing on the LOH? People keep telling me that the CLR's GC "adapts" to your load, but it's not changing it's behavior in this case and, unlike other runtimes, there seems to be little I can do to tune it (have tried workstation GC, but there is very little observable difference).

Microsoft decided to design the String class so that all strings are stored in memory as a monolithic sequence of characters. While this works well for some usage patterns, it works dreadfully for others.
One thing I've found very helpful is to avoid creating instances of String whenever possible. If a method will often be used to operate on part of a supplied string, and will in turn ask other methods to operate on parts of it, the methods should accept arguments specifying the range of the String upon which they should operate. This will avoid the need for callers of the first method to use Subst to construct a new String for the method to act upon, and will avoid the need to have the method call Subst to feed portions of the string to its callers. In some cases where I have used this technique, the creation of thousands of String instances--some quite large--could be replaced with zero.

CLR's GC "adapts" to your load
It can't know how much memory you are willing to tolerate as overhead. Here, you probably want to give the app like 5GB of heap so that collections are much rarer. The GC has no built-in tuning knobs for that (subjective note: that's a pitty).
You can force bigger heap sizes by using one of the low latency modes for short durations. That should cause the GC to try hard to avoid G2 collections. Monitor the RAM usage and disable low latency mode when consumption reaches 5GB.
This is a risky strategy but it's the best I think you can do.
I would not do it. You can maximally gain 2x throughput. Your CPU is maxed out, right? Workstation GC does not scale to multiple cores and leaves CPUs unused.

Leveraging 64 bit architecture effectively when porting applications

I'm considering porting a processor intensive 32 bit .NET 4 (C#) console application to 64 bit. It is a parallelized number crunching algorithm that runs for about five hours.
I know a bit about performance benefits but not sure how to change the code to take full advantage of 64 bit architecture. For example should I consider changing all Int32 types to Int64 etc.?
Where could I find some comprehensive resources to learn about optimization considerations when porting to x64? Searching around has come up mostly with general information about the x64 architecture.

There's little hope that your hard work will pay off. An x64 core ticks at the same pace as an x86 core, clock speeds are the same and constrained by the physics of silicon. One clear advantage an x64 core has is that it can whip 64-bits around at the same time, double the amount of an x86 core. But in practice there are few real world problems that actually have a need for that range. Most integer problems work just fine with a range of +/- two billion. It is easy to tell from your code, if you have a lot of long instead of int variables in your inner-most loops then you'll be ahead.
A significant disadvantage of a x64 core is that it consumes the available cache a lot quicker. That too is constrained by silicon, there's only so much memory they can fit on a chip. Any pointer is double the size, 8 bytes instead of 4 bytes in 32-bit mode. Pointers are used for any object reference. Cache is a very big deal on modern cores, the memory bus is glacially slow compared to the speed of the core. This disadvantage is balanced somewhat by x64 having a 8 more cpu registers (r8 through r15).
The net effect is that 32-bit code usually executes a bit faster than 64-bit code. You only get a benefit from 64-bit code when your program is bogged down by having to cram the data it processes in a 2 gigabyte address space. In other words, having to use files or memory-mapped files to avoid running out of memory. If your app is compute constrained, as suggested in your question, it is very unlikely you'll be ahead.

For pure .NET Code it seems you can't optimize yourself.
Compile to any cpu and the IL compiler will optimize to the x64 for things like extra memory space and a couple of registers more.
See How can compiling my application for 64-bit make it faster or better? for similar discussion

You did not mention whether your application is also memory intensive. However,
be aware that when making this transition, your applications memory footprint is expected to grow significantly. Since all the reference type pointers are now taking twice the space. In other words, although you have an effectively infinite address space, effective memory utilization is still limited by RAM.

Changing 32bit to 64bit types in itself doesn't help. 32bit operations are all as fast or faster than 64bit operations on x64. What's different is that 64bit operations are faster than in 32bit mode. So for example if you use an array of int32's and you can also use an array of int64's that's half as long, it's likely to help.

How to minimize memory usage in windows

I wrote a hello world program (windows application without any UI) in C#. The release-build excutable doesn't do anything but to Thread.Sleep(50000) //50 seconds.
I opened sysinternals (a profiler like task manager). This excutable ate 7MB memory (private bytes)!!
Can anybody explain what is happening and how to make the memory usage smaller.
P.S. I also tried to use NGEN to pre-compile the .exe but still got the same memory usage.
Thanks a lot

C# (and other JIT compiled/interpreted languages) usually end up doing a lot of things for you automatically in the background. While what you've written is simple, there's a lot of stuff going on in the background to support the JIT nature of the application.
That 7MB of memory is relatively small given 2GB of RAM is fairly commonplace these days. And it probably won't go up more unless you do something unusual or allocate lots of arrays and data structures.
If it's a Hello World based on the C# WindowsApplication project type, and there's an int main in Program.cs doing an application.run on a Windows Form, then not only is there a lot of JIT overhead, but there's a lot of Windows Forms overhead too.
End of the day, I'm sure everything is dandy.

.Net apps have a lot of basic overhead, but your memory increase should be relatively small after that point.
Example. My one app consumes 10MB of memory on load, but it only consumes 40MB of memory after loading 60k rows from a Database, each row containing multiple strings and many values.
A small fixed upfront value isn't that bad, on modern computers. Scalability is the primary issue now-a-days. .Net scales fairly well for being managed.

Memory Bandwidth Usage

How do you calculate memory (RAM) bandwidth used? Which performance counters are required?
I came across a tool that was able to do it, the "Rightmark multi-threaded memory test". But unlike the rest of Rightmark's tests, I haven't found the source code for it, just the binaries

If your code can run on Linux, use Cachegrind:
Cachegrind is a cache profiler. It
performs detailed simulation of the
I1, D1 and L2 caches in your CPU and
so can accurately pinpoint the sources
of cache misses in your code. It
identifies the number of cache misses,
memory references and instructions
executed for each line of source code,
with per-function, per-module and
whole-program summaries. It is useful
with programs written in any language.
Cachegrind runs programs about
20--100x slower than normal.
You may want to use the KCacheGrind GUI.

It is very difficult to 'calculate' memory bandwidth usage. There are lots of non-trivial cache and MMU issues to contend with. The only real way to do it is either through the use of simulation or real-world measurements.
You can get a 'rough' idea by debugging the code and counting the number of memory load and store operations performed. However, knowing whether it was a cache hit/miss is another issue.
It depends on your purpose. If it is to obtain a guesstimate, you can use the rule of thumb that about 30% of general purpose code is memory loads and stores. If you're trying to get a worst case, you can assume that caches miss all the time and work it out.
One potential thing you could do is to look at virtualisation. There are several open source options (QEMU comes to mind). It may be possible to export certain hardware measurements from them.

Multiprocessor and Performance

I'm facing a really strange problem with a .Net service.
I developed a multithreaded x64 windows service.
I tested this service in a x64 server with 8 cores. The performance was great!
Now I moved the service to a production server (x64 - 32 cores). During the tests I found out the performance is, at least, 10 times worst than in the test server.
I've checked loads of performance counters trying to find some reason for this poor performance, but I couldn't find a point.
Could be a GC problem? Have you ever faced a problem like this?
Thank you in advance!
Alexandre

This is a common problem which people are generally unaware of, because very few people have experience on many-CPU machines.
The basic problem is contention.
As the CPU count increases, contention increases in all shared data structures. For low CPU counts, contention is low and the fact you have multiple CPUs improves performance. As the CPU count becomes significantly larger, contention begins to drown out your performance improvements; as the CPU count becomes large, contention actually starts reducing performance below that of a lower number of CPUs.
You are basically facing one of the aspects of the scalability problem.
I'm not sure however where this problem lies; in your data structures, or in the operating systems data structures. The former you can address - lock-free data structures are an excellent, highly scalable approach. The latter is difficult, since it essentially requires avoiding certain OS functionality.

There are way too many variables to know why one machine is slower than the other. 32 core machines are usually more specialized where an eight core could just be a dual proc quad core machine. Are there vm's or other things running at the same time? Usually with that many cores, IO bandwidth becomes the limiting factor (even if the cpu's still have plenty of bandwidth).
To start off, you should probably add lots of timers in your code (or profiling or whatever) to figure out what part of your code is taking up the most time.
Performance troublshooting 101: what is the bottleneck ( where in the code and what subsystem (memory, disk, cpu) )

There are so many factors here:
are you actually using the cores?
are your extra threads causing locking issues to be more obvious?
do you not have enough memory to support all the extra stacks / data you can process?
can your IO (disk/network/database) stack keep up with the throughput?
etc

Could it be down to differences in memory or the disk? If there were the bottleneck, you'd not get the value for the additional processing power. Can't really tell without more details of your application/configuration.

With that many threads running concurrently, you're going to have to be really careful to get around issues of threads fighting with each other to access your data. Read up on Non-blocking synchronization.

How many threads are you using? Using to many thread pool threads could cause thread starvation which would make your program slower.
Some articles:
http://www2.sys-con.com/ITSG/virtualcd/Dotnet/archives/0112/gomez/index.html
http://codesith.blogspot.com/2007/03/thread-starvation-in-shared-thread-pool.html
(search for thread starvation in them)
You could use a .net profiler to find your bottle necks, here are a good free one:
http://www.eqatec.com/tools/profiler

I agree with Blank, it's likely to be some form of contention. It's likely to be very hard to track down, unfortunately. It could be in your application code, the framework, the OS, or some combination thereof. Your application code is the most likely culprit, since Microsoft has expended significant effort on making the CLR and the OS scale on 32P boxes.
The contention could be in some hot locks, but it could be that some processor cache lines are sloshing back and forth between CPUs.
What's your metric for 10x worse? Throughput?
Have you tried booting the 32-proc box with fewer CPUs? Use the /NUMPROC option in boot.ini or BCDedit.
Do you achieve 100% CPU utilization? What's your context switch rate like? And how does this compare to the 8P box?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.