Determining cache sector size for a processor

Determining cache sector size for a processor - c#

I'm trying to build tests around processor cache-line optimizations relative to parallel processing. Specifically, I'm testing how segments of my products are being impacted by False Sharing inefficiencies. To do this, I need to be able to determine my processors cache sector size (Ex. 64 bytes) so I can contrive tests with the appropriate object size ranges. So... how or where can I get this information (e.g. processor spec page, C# API call, etc...)? Cache sector size is also known as Cache Line size.
Note: I looked on the Intel site for my i7 processor spec and can't find these details, or maybe I just can't recognize it.

I have done a similar experiment. I use CPUZ and find it extremely helpful with detailed information about CPU cores, caches (L1, L2, etc)...
My suggestion: don't be distracted too much by hardware specs, focus on benchmarking because your experiment is going to take a lot of time.

Related

Benchmarking RAM performance - UWP and C#

I'm developing a benchmarking application using Universal Windows Platform that evaluates CPU and RAM performance of a Windows 10 system.
Although I found different algorithms to benchmark a CPU, I still didn't found any solid algorithm or solution to evaluate the write and read speeds of memory.
How can I achieve this in C#?
Thanks in advance :)

I don't see why this would not be possible from managed code. Array access code turns into normal x86 memory instructions. It's a thin abstraction. In particular I don't see why you would need a customized OS.
You should be able to test sequential memory speed by performing memcpy on big arrays. They must be bigger than the last level cache size.
You can test random access by randomly indexing into a big array. The index calculation must be cheap, unpredictable and there must be a dependency chain that serializes the memory instructions so that the CPU cannot parallelize them.
Honestly I don't think its possible. RAM benchmarks usually run off of dedicated OS's
RAM testing is different from RAM benchmarking.
C# doesn't give you that kind of control over RAM
Of course, just new up a big array and access it. Also, understand the overheads that are present. The only overhead is a range check.
The GC has no impact during the benchmark. It might be triggered by an allocation.

Comparing available processing power of two machines

Think of a load balancer which is to balance the load according to the available (remaining) processing power of its units. How would you calculate this parameter to compare?
I'm trying to implement this in C# and so far I can query the CPU usage in percentage but that doesn't do since different machines might be using different processors. Perhaps if I could find out the processing power of each machine multiplied by its free CPU percentage, it would be a good estimate.
But what are the important parameters of a processor to include and how to aggregate them into one single number?

how much cpu should a single thread application use?

I have a single thread console application.
I am confused with the concept of CPU usage. Should a good single thread application use ~100% of cpu usage (since it is available) or it should not use lots of cpu usage (since it can cause the computer to slow down)?
I have done some research but haven't found an answer to my confusion. I am a student and still learning so any feedback will be appreciated. Thanks.

It depends on what the program needs the CPU for. If it has to do a lot of work, it's common to use all of one core for some period of time. If it spends most of its time waiting for input, it will naturally tend to use the CPU less frequently. I say "less frequently" instead of "less" because:
Single threaded programs are, at any given time, either running, or they're not, so they are always using either 100% or 0% of one CPU core. Programs that appear to be only using 50% or 30% or whatever are actually just balancing periods of computational work with periods of waiting for input. Devices like hard drives are very slow compared to the CPU, so a program that's reading a lot of data from disk will use less CPU resources than one that crunches lots of numbers.
It's normal for a program to use 100% of the CPU sometimes, often even for a long time, but it's not polite to use it if you don't need it (i.e. busylooping). Such behavior crowds out other programs that could be using the CPU.
The same goes with the hard drive. People forget that the hard drive is a finite resource too, mostly because the task manager doesn't have a hard drive usage by percentage. It's difficult to gauge hard drive usage as a percentage of the total since disk accesses don't have a fixed speed, unlike the processor. However, it takes much longer to move 1GB of data on disk than it does to use the CPU to move 1GB of data in memory, and the performance impacts of HDD hogging are as bad or worse than those of CPU hogging (they tend to slow your system to a crawl without looking like any CPU usage is going on. You have probably seen it before)
Chances are that any small academic programs you write at first will use all of one core for a short period of time, and then wait. Simple stuff like prompting for a number at the command prompt is the waiting part, and doing whatever operation ad academia on it afterwards is the active part.

It depends on what it's doing. Different types of operations have different needs.
There is no non-subjective way to answer this question that apples across the boards.
The only answer that's true is "it should use only the amount of CPU necessary to do the job, and no more."
In other words, optimize as much as you can and as is reasonable. In general, the lower the CPU the better, the faster it will perform, and the less it will crash, and the less it will annoy your users.

Typically an algoritmically heavy task such as predicting weather will have to be managed by the os, because it will need all of the cpu for as much time as it will be allowed to run (untill it's done).
On the other hand, a graphical application with a static user interface, like a windows forms application for storing a bit of data for record-keeping should require very low cpu usage, since it's mainly waiting for the user to do something.

how does a c# profiler work?

I'm curious how does a typical C# profiler work?
Are there special hooks in the virtual machine?
Is it easy to scan the byte code for function calls and inject calls to start/stop timer?
Or is it really hard and that's why people pay for tools to do this?
(as a side note i find a bit interesting bec it's so rare - google misses the boat completely on the search "how does a c# profiler work?" doesn't work at all - the results are about air conditioners...)

There is a free CLR Profiler by Microsoft, version 4.0.
https://www.microsoft.com/downloads/en/details.aspx?FamilyID=be2d842b-fdce-4600-8d32-a3cf74fda5e1
BTW, there's a nice section in the CLR Profiler doc that describes how it works, in detail, page 103. There's source as part of distro.

Is it easy to scan the byte code for
function calls and inject calls to
start/stop timer?
Or is it really hard and that's why
people pay for tools to do this?
Injecting calls is hard enough that tools are needed to do it.
Not only is it hard, it's a very indirect way to find bottlenecks.
The reason is what a bottleneck is is one or a small number of statements in your code that are responsible for a good percentage of time being spent, time that could be reduced significantly - i.e. it's not truly necessary, i.e. it's wasteful.
IF you can tell the average inclusive time of one of your routines (including IO time), and IF you can multiply it by how many times it has been called, and divide by the total time, you can tell what percent of time the routine takes.
If the percent is small (like 10%) you probably have bigger problems elsewhere.
If the percent is larger (like 20% to 99%) you could have a bottleneck inside the routine.
So now you have to hunt inside the routine for it, looking at things it calls and how much time they take. Also you want to avoid being confused by recursion (the bugaboo of call graphs).
There are profilers (such as Zoom for Linux, Shark, & others) that work on a different principle.
The principle is that there is a function call stack, and during all the time a routine is responsible for (either doing work or waiting for other routines to do work that it requested) it is on the stack.
So if it is responsible for 50% of the time (say), then that's the amount of time it is on the stack,
regardless of how many times it was called, or how much time it took per call.
Not only is the routine on the stack, but the specific lines of code costing the time are also on the stack.
You don't need to hunt for them.
Another thing you don't need is precision of measurement.
If you took 10,000 stack samples, the guilty lines would be measured at 50 +/- 0.5 percent.
If you took 100 samples, they would be measured as 50 +/- 5 percent.
If you took 10 samples, they would be measured as 50 +/- 16 percent.
In every case you find them, and that is your goal.
(And recursion doesn't matter. All it means is that a given line can appear more than once in a given stack sample.)
On this subject, there is lots of confusion. At any rate, the profilers that are most effective for finding bottlenecks are the ones that sample the stack, on wall-clock time, and report percent by line. (This is easy to see if certain myths about profiling are put in perspective.)

1) There's no such thing as "typical". People collect profile information by a variety of means: time sampling the PC, inspecting stack traces, capturing execution counts of methods/statements/compiled instructions, inserting probes in code to collect counts and optionally calling contexts to get profile data on a call-context basis. Each of these techniques might be implemented in different ways.
2) There's profiling "C#" and profiling "CLR". In the MS world, you could profile CLR and back-translate CLR instruction locations to C# code. I don't know if Mono uses the same CLR instruction set; if they did not, then you could not use the MS CLR profiler; you'd have to use a Mono IL profiler. Or, you could instrument C# source code to collect the profiling data, and then compile/run/collect that data on either MS, Mono, or somebody's C# compatible custom compiler, or C# running in embedded systems such as WinCE where space is precious and features like CLR-built-ins tend to get left out.
One way to instrument source code is to use source-to-source transformations, to map the code from its initial state to code that contains data-collecting code as well as the original program. This paper on instrumenting code to collect test coverage data shows how a program transformation system can be used to insert test coverage probes by inserting statements that set block-specific boolean flags when a block of code is executed. A counting-profiler substitutes counter-incrementing instructions for those probes. A timing profiler inserts clock-snapshot/delta computations for those probes. Our C# Profiler implements both counting and timing profiling for C# source code both ways; it also collect the call graph data by using more sophisticated probes that collect the execution path. Thus it can produce timing data on call graphs this way. This scheme works anywhere you can get your hands on a halfway decent resolution time value.

This is a link to a lengthy article that discusses both instrumentation and sampling methods:
http://smartbear.com/support/articles/aqtime/profiling/

Memory Bandwidth Usage

How do you calculate memory (RAM) bandwidth used? Which performance counters are required?
I came across a tool that was able to do it, the "Rightmark multi-threaded memory test". But unlike the rest of Rightmark's tests, I haven't found the source code for it, just the binaries

If your code can run on Linux, use Cachegrind:
Cachegrind is a cache profiler. It
performs detailed simulation of the
I1, D1 and L2 caches in your CPU and
so can accurately pinpoint the sources
of cache misses in your code. It
identifies the number of cache misses,
memory references and instructions
executed for each line of source code,
with per-function, per-module and
whole-program summaries. It is useful
with programs written in any language.
Cachegrind runs programs about
20--100x slower than normal.
You may want to use the KCacheGrind GUI.

It is very difficult to 'calculate' memory bandwidth usage. There are lots of non-trivial cache and MMU issues to contend with. The only real way to do it is either through the use of simulation or real-world measurements.
You can get a 'rough' idea by debugging the code and counting the number of memory load and store operations performed. However, knowing whether it was a cache hit/miss is another issue.
It depends on your purpose. If it is to obtain a guesstimate, you can use the rule of thumb that about 30% of general purpose code is memory loads and stores. If you're trying to get a worst case, you can assume that caches miss all the time and work it out.
One potential thing you could do is to look at virtualisation. There are several open source options (QEMU comes to mind). It may be possible to export certain hardware measurements from them.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.