C# performance analysis- how to count CPU cycles?

C# performance analysis- how to count CPU cycles? - c#

Is this a valid way to do performance analysis? I want to get nanosecond accuracy and determine the performance of typecasting:
class PerformanceTest
{
static double last = 0.0;
static List<object> numericGenericData = new List<object>();
static List<double> numericTypedData = new List<double>();
static void Main(string[] args)
{
double totalWithCasting = 0.0;
double totalWithoutCasting = 0.0;
for (double d = 0.0; d < 1000000.0; ++d)
{
numericGenericData.Add(d);
numericTypedData.Add(d);
}
Stopwatch stopwatch = new Stopwatch();
for (int i = 0; i < 10; ++i)
{
stopwatch.Start();
testWithTypecasting();
stopwatch.Stop();
totalWithCasting += stopwatch.ElapsedTicks;
stopwatch.Start();
testWithoutTypeCasting();
stopwatch.Stop();
totalWithoutCasting += stopwatch.ElapsedTicks;
}
Console.WriteLine("Avg with typecasting = {0}", (totalWithCasting/10));
Console.WriteLine("Avg without typecasting = {0}", (totalWithoutCasting/10));
Console.ReadKey();
}
static void testWithTypecasting()
{
foreach (object o in numericGenericData)
{
last = ((double)o*(double)o)/200;
}
}
static void testWithoutTypeCasting()
{
foreach (double d in numericTypedData)
{
last = (d * d)/200;
}
}
}
The output is:
Avg with typecasting = 468872.3
Avg without typecasting = 501157.9
I'm a little suspicious... it looks like there is nearly no impact on the performance. Is casting really that cheap?
Update:
class PerformanceTest
{
static double last = 0.0;
static object[] numericGenericData = new object[100000];
static double[] numericTypedData = new double[100000];
static Stopwatch stopwatch = new Stopwatch();
static double totalWithCasting = 0.0;
static double totalWithoutCasting = 0.0;
static void Main(string[] args)
{
for (int i = 0; i < 100000; ++i)
{
numericGenericData[i] = (double)i;
numericTypedData[i] = (double)i;
}
for (int i = 0; i < 10; ++i)
{
stopwatch.Start();
testWithTypecasting();
stopwatch.Stop();
totalWithCasting += stopwatch.ElapsedTicks;
stopwatch.Reset();
stopwatch.Start();
testWithoutTypeCasting();
stopwatch.Stop();
totalWithoutCasting += stopwatch.ElapsedTicks;
stopwatch.Reset();
}
Console.WriteLine("Avg with typecasting = {0}", (totalWithCasting/(10.0)));
Console.WriteLine("Avg without typecasting = {0}", (totalWithoutCasting / (10.0)));
Console.ReadKey();
}
static void testWithTypecasting()
{
foreach (object o in numericGenericData)
{
last = ((double)o * (double)o) / 200;
}
}
static void testWithoutTypeCasting()
{
foreach (double d in numericTypedData)
{
last = (d * d) / 200;
}
}
}
The output is:
Avg with typecasting = 4791
Avg without typecasting = 3303.9

Note that it's not typecasting that you are measuring, it's unboxing. The values are doubles all along, there is no type casting going on.
You forgot to reset the stopwatch between tests, so you are adding the accumulated time of all previous tests over and over. If you convert the ticks to actual time, you see that it adds up to much more than the time it took to run the test.
If you add a stopwatch.Reset(); before each stopwatch.Start();, you get a much more reasonable result like:
Avg with typecasting = 41027,1
Avg without typecasting = 20594,3
Unboxing a value is not so expensive, it only has to check that the data type in the object is correct, then get the value. Still it's a lot more work than when the type is already known. Remember that you are also measuring the looping, calculation and assigning of the result, which is the same for both tests.
Boxing a value is more expensive than unboxing it, as that allocates an object on the heap.

1) Yes, casting is usually (very) cheap.
2) You are not going to get nanosecond accuracy in a managed language. Or in an unmanaged language under most operating systems.
Consider
other processes
garbage collection
different JITters
different CPUs
And, your measurement includes the foreach loop, looks like 50% or more to me. Maybe 90%.

When you call Stopwatch.Start it is letting the timer continue to run from wherever it left off. You need to call Stopwatch.Reset() to set the timers back to zero before starting again. Personally I just use stopwatch = Stopwatch.StartNew() whenever I want to start a timer to avoid this sort of confusion.
Furthermore, you probably want to call both of your test methods before starting the "timing loop" so that they get a fair chance to "warm up" that piece of code and ensure that the JIT has had a chance to run to even the playing field.
When I do that on my machine, I see that testWithTypecasting runs in approximately half the time as testWithoutTypeCasting.
That being said however, the cast itself it not likely to be the most significant part of that performance penalty. The testWithTypecasting method is operating on a list of boxed doubles which means that there is an additional level of indirection required to retrieve each value (follow a reference to the value somewhere else in memory) in addition to increasing the total amount of memory consumed. This increases the amount of time spent on memory access and is likely to be a bigger effect than the CPU time spent "in the cast" itself.

Look into performance counters in the System.Diagnostics namespace, When you create a new counter, you first create a category, and then specify one or more counters to be placed in it.
// Create a collection of type CounterCreationDataCollection.
System.Diagnostics.CounterCreationDataCollection CounterDatas =
new System.Diagnostics.CounterCreationDataCollection();
// Create the counters and set their properties.
System.Diagnostics.CounterCreationData cdCounter1 =
new System.Diagnostics.CounterCreationData();
System.Diagnostics.CounterCreationData cdCounter2 =
new System.Diagnostics.CounterCreationData();
cdCounter1.CounterName = "Counter1";
cdCounter1.CounterHelp = "help string1";
cdCounter1.CounterType = System.Diagnostics.PerformanceCounterType.NumberOfItems64;
cdCounter2.CounterName = "Counter2";
cdCounter2.CounterHelp = "help string 2";
cdCounter2.CounterType = System.Diagnostics.PerformanceCounterType.NumberOfItems64;
// Add both counters to the collection.
CounterDatas.Add(cdCounter1);
CounterDatas.Add(cdCounter2);
// Create the category and pass the collection to it.
System.Diagnostics.PerformanceCounterCategory.Create(
"Multi Counter Category", "Category help", CounterDatas);
see MSDN docs

Just a thought but sometimes identical machine code can take a different number of cycles to execute depending on its alignment in memory so you might want to add a control or controls.

Don't "do" C# myself but in C for x86-32 and later the rdtsc instruction is usually available which is much more accurate than OS ticks. More info on rdtsc can be found by searching stackoverflow. Under C it is usually available as an intrinsic or built-in function and returns the number of clock cycles (in an 8 byte - long long/__int64 - unsigned integer) since the computer was powered up. So if the CPU has a clock speed of 3 Ghz the underlying counter is incremented 3 billion times per second. Save for a few early AMD processors, all multi-core CPUs will have their counters synchronized.
If C# does not have it you might consider writing a VERY short C function to access it from C#. There is a great deal of overhead if you access the instruction through a function vs inline. The difference between two back-to-back calls to the function will be the basic measurement overhead. If you're thinking of metering your application you'll have to determine several more complex overhead values.
You might consider shutting off the CPU energy-saving mode (and restarting the PC) as it lowers the clock frequency being fed to the CPU during periods of low activity. This is since it causes the time stamp counters of the different cores to become un-synchronized.

Related

Why the time to execute a method in GPU is more compare to in CPU in hybridizer projects?

I am running the Hello World example of hybridizer-basic-samples.But the time taking for the execution is more in GPU than Cpu.
[EntryPoint("run")]
public static void Run(int N, double[] a, double[] b)
{
Parallel.For(0, N, i => { a[i] += b[i]; });
}
static void Main(string[] args)
{
int N = 1024 * 1024 * 16;
double[] acuda = new double[N];
double[] adotnet = new double[N];
double[] b = new double[N];
Random rand = new Random();
for (int i = 0; i < N; ++i)
{
acuda[i] = rand.NextDouble();
adotnet[i] = acuda[i];
b[i] = rand.NextDouble();
}
cudaDeviceProp prop;
cuda.GetDeviceProperties(out prop, 0);
HybRunner runner = HybRunner.Cuda().SetDistrib(prop.multiProcessorCount * 16, 128);
dynamic wrapped = runner.Wrap(new Program());
// run the method on GPU
var watch = System.Diagnostics.Stopwatch.StartNew();
wrapped.Run(N, acuda, b);
watch.Stop();
Console.WriteLine($"Execution Time: {watch.ElapsedMilliseconds} ms");
// run .Net method
var watch2 = System.Diagnostics.Stopwatch.StartNew();
Run(N, adotnet, b);
watch2.Stop();
Console.WriteLine($"Execution Time: {watch2.ElapsedMilliseconds} ms");
}
When i run the program, the execution time of the Run() in GPU is always more than the .Net method.Like for the GPU execution it took 818ms but for the cpu,89ms.Can any one please explain me the reason?

As mentioned by #InBetween, it is likely that you are measuring compiler overhead. It is good practice to do a warmup pass to let all code compile first. Or use something like benchmarking.net that does that for you.
Another possible reason is overhead. When running things on a GPU the system would need to copy the input data to GPU memory, and copy the result back again. There will probably also be other costs involved. Adding numbers together is a very simple operation, so it is likely the processor can run at max theoretical speed.
Lets do some back of the envelope calculations. Assume the CPU can do 4 adds per clock (i.e. what AVX256 can do). 4 * 8 bytes per double = 32 bytes per clock, and 4*10^9 clocks per second. This gives 128 GB/s in processing speed. This is significantly higher than the PCIe 3 x16 bandwith of 16GB/s. You will probably not reach this speed due to other limitations, but it shows that the limiting factor is probably not the processor itself, so using a GPU will probably not improve things.
GPU processing should show better gains when using more complicated algorithms that do more processing for each data-item.

C#: How is this performing faster with multiple functions, than inline code?

Story
So, I wanted to create a small game for cross platform, but then I ended up in devices that don't support JIT, such as the IPhone, Windows mobile and Xbox One (game side, not application side).
Since the game had to generate some "basic" code out of text files with scripts in them, like formulas, assignments, call functions, modify/store values in a dictionary per object (sort of like a hybrid interactive fiction game), it wasn't really possible to do with AOT compilation.
After some thinking, I came up with a way around it, store collection of functions and what not, to "emulate" normal code. if this way was alot slower than twice as the compiled code, then I would consider dropping devices that couldn't run JIT compiled code.
I was expecting the compiled code in visual studio to be the fasted, and the Linq.Expressions to be about max 10% slower.
The hack of storing the functions and calling them for each and almost everything, I was expecting to be quite alot slower than compiled code, but..
Too my surprise, it is faster???
Note:
This project is primarily about learning and personal interests in my free time.
The end product is just a bonus, being able to sell or make it open source.
Testing
Here is a test example of what I'm doing, and "trying" to model how the code would be used, where there are multiple "scripts" that have different functions and parameters, that operate on the TestObject.
Interesting parts of the code are:
The constructor of the classes that derive from PerfTest.
The Perform(TestObject obj) functions that they override.
This was compiled with Visual Studio 2017
.Net Framework 4.7.2
In release mode.
Optimizations turned on.
Platform target = x86 (haven't tested on ARM yet)
Tested the program with visual studio, and standalone, didn't make any noticeable difference in performance.
Console Test Program
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq.Expressions;
namespace Test
{
class Program
{
static void Main(string[] args)
{
new PerformanceTest();
Console.WriteLine();
Console.WriteLine("Done, press enter to exit");
Console.ReadLine();
}
}
class TestObject
{
public Dictionary<string, float> data = new Dictionary<string, float>();
public TestObject(Random rnd)
{
data.Add("A", (float)rnd.NextDouble());
data.Add("B", (float)rnd.NextDouble());
data.Add("C", (float)rnd.NextDouble());
data.Add("D", (float)rnd.NextDouble() + 1.0f);
data.Add("E", (float)rnd.NextDouble());
data.Add("F", (float)rnd.NextDouble() + 1.0f);
}
}
class PerformanceTest
{
Stopwatch timer = new Stopwatch();
public PerformanceTest()
{
var rnd = new Random(1);
int testSize = 5000000;
int testTimes = 5;
Console.WriteLine($"Creating {testSize} objects to test performance with");
timer.Start();
var data = new TestObject[testSize];
for (int i = 0; i < data.Length; i++)
data[i] = new TestObject(rnd);
Console.WriteLine($"Created objects in {timer.ElapsedMilliseconds} milliseconds");
int handlers = 1000;
Console.WriteLine($"Creating {handlers} handlers per type");
var tests = new PerfTest[3][];
tests[0] = new PerfTest[handlers];
tests[1] = new PerfTest[handlers];
tests[2] = new PerfTest[handlers];
for (int i = 0; i < tests[0].Length; i++)
tests[0][i] = new TestNormal();
for (int i = 0; i < tests[1].Length; i++)
tests[1][i] = new TestExpression();
for (int i = 0; i < tests[2].Length; i++)
tests[2][i] = new TestOther();
Console.WriteLine($"Handlers created");
Console.WriteLine($"Warming up all handlers");
for (int t = 0; t < tests.Length; t++)
for (int i = 0; i < tests[t].Length; i++)
tests[t][i].Perform(data[0]);
Console.WriteLine($"Testing data {testTimes} times with handlers of each type");
for (int i = 0; i < testTimes; i++)
{
Console.WriteLine();
for (int t = 0; t < tests.Length; t++)
Loop(tests[t], data);
}
timer.Stop();
}
void Loop(PerfTest[] test, TestObject[] data)
{
var rnd = new Random(1);
var start = timer.ElapsedMilliseconds;
double sum = 0;
for (int i = 0; i < data.Length; i++)
sum += test[rnd.Next(test.Length)].Perform(data[i]);
var stop = timer.ElapsedMilliseconds;
var elapsed = stop - start;
Console.WriteLine($"{test[0].Name}".PadRight(25) + $"{elapsed} milliseconds".PadRight(20) + $"sum = { sum}");
}
}
abstract class PerfTest
{
public string Name;
public abstract float Perform(TestObject obj);
}
class TestNormal : PerfTest
{
public TestNormal()
{
Name = "\"Normal\"";
}
public override float Perform(TestObject obj) => obj.data["A"] * obj.data["B"] + obj.data["C"] / obj.data["D"] + obj.data["E"] / (obj.data["E"] + obj.data["F"]);
}
class TestExpression : PerfTest
{
Func<TestObject, float> compiledExpression;
public TestExpression()
{
Name = "Compiled Expression";
var par = Expression.Parameter(typeof(TestObject));
var body = Expression.Add(Expression.Multiply(indexer(par, "A"), indexer(par, "B")), Expression.Add(Expression.Divide(indexer(par, "C"), indexer(par, "D")), Expression.Divide(indexer(par, "E"), Expression.Add(indexer(par, "E"), indexer(par, "F")))));
var lambda = Expression.Lambda<Func<TestObject, float>>(body, par);
compiledExpression = lambda.Compile();
}
static Expression indexer(Expression parameter, string index)
{
var property = Expression.Field(parameter, typeof(TestObject).GetField("data"));
return Expression.MakeIndex(property, typeof(Dictionary<string, float>).GetProperty("Item"), new[] { Expression.Constant(index) });
}
public override float Perform(TestObject obj) => compiledExpression(obj);
}
class TestOther : PerfTest
{
Func<TestObject, float>[] parameters;
Func<float, float, float, float, float, float, float> func;
public TestOther()
{
Name = "other";
Func<float, float, float, float, float, float, float> func = (a, b, c, d, e, f) => a * b + c / d + e / (e + f);
this.func = func; // this delegate will come from a collection of functions, depending on type
parameters = new Func<TestObject, float>[]
{
(o) => o.data["A"],
(o) => o.data["B"],
(o) => o.data["C"],
(o) => o.data["D"],
(o) => o.data["E"],
(o) => o.data["F"],
};
}
float call(TestObject obj, Func<float, float, float, float, float, float, float> myfunc, Func<TestObject, float>[] parameters)
{
return myfunc(parameters[0](obj), parameters[1](obj), parameters[2](obj), parameters[3](obj), parameters[4](obj), parameters[5](obj));
}
public override float Perform(TestObject obj) => call(obj, func, parameters);
}
}
Output result of this Console test:
Creating 5000000 objects to test performance with
Created objects in 7489 milliseconds
Creating 1000 handlers per type
Handlers created
Warming up all handlers
Testing data 5 times with handlers of each type
"Normal" 811 milliseconds sum = 4174863.85436047
Compiled Expression 1371 milliseconds sum = 4174863.85436047
other 746 milliseconds sum = 4174863.85436047
"Normal" 812 milliseconds sum = 4174863.85436047
Compiled Expression 1379 milliseconds sum = 4174863.85436047
other 747 milliseconds sum = 4174863.85436047
"Normal" 812 milliseconds sum = 4174863.85436047
Compiled Expression 1373 milliseconds sum = 4174863.85436047
other 747 milliseconds sum = 4174863.85436047
"Normal" 812 milliseconds sum = 4174863.85436047
Compiled Expression 1373 milliseconds sum = 4174863.85436047
other 747 milliseconds sum = 4174863.85436047
"Normal" 812 milliseconds sum = 4174863.85436047
Compiled Expression 1375 milliseconds sum = 4174863.85436047
other 746 milliseconds sum = 4174863.85436047
Done, press enter to exit
Question
Why is the class TestOther's Perform function faster than both
TestNormal and TestExpression?
And I expected the TestExpression to be closer to the TestNormal, why is it so far off?

When in doubt put the code into a profiler. I have looked at it and found that the main difference between the two fast ones and the slow compiled Expression was the dictionary lookup performance.
The Expression version needs more than twice as much CPU in Dictionary FindEntry compared to the others.
Stack Weight (in view) (ms)
GameTest.exe!Test.PerformanceTest::Loop 15,243.896600
|- Anonymously Hosted DynamicMethods Assembly!dynamicClass::lambda_method 6,038.952700
|- GameTest.exe!Test.TestNormal::Perform 3,724.253300
|- GameTest.exe!Test.TestOther::call 3,493.239800
Then I did check the generated assembly code. It did look nearly identical and cannot explain the vast margin the expression version looses.
I did also break into Windbg if different things were passed to the Dictionary[x] call but all did look normal.
To sum it up all of your versions do essentially the same amount of work (minus the double E lookup of the dictionary version but that plays no role for our Factor two) but the Expression version needs twice as much CPU. That is really a mystery.
Your benchmark code calls on each run a random test class instance. I have replaced that random walk by taking always the first instance instead of that random one:
for (int i = 0; i < data.Length; i++)
// sum += test[rnd.Next(test.Length)].Perform(data[i]);
sum += test[0].Perform(data[i]);
and now I get much better values:
Compiled Expression 740 milliseconds sum = 4174863.85440933
"Normal" 743 milliseconds sum = 4174863.85430179
other 714 milliseconds sum = 4174863.85430179
The problem with your code was/is that due to the many indirections you did get one indirection too far and the branch predictor of the CPU was no longer able to predict the next call target of the compiled expression which involves two hops. When I use the random walk then I get back the "bad" performance:
Compiled Expression 1359 milliseconds sum = 4174863.85440933
"Normal" 775 milliseconds sum = 4174863.85430179
other 771 milliseconds sum = 4174863.85430179
The observed bad behavior is highly CPU dependant and related to the CPU code and data cache size. I do not have VTune at hand to back that up with numbers but this once again shows that todays CPUs are tricky beasts.
I did run my code on a Core(TM) i7-4770K CPU # 3.50GHz.
Dictionaries are known to be very bad for cache predictors because they tend to wildly jump around in memory where no pattern can be found. The many dictionary calls seem to confuse the predictor already quite a bit and the additional randomnes of the used test instance and the more complex dispatch of the compiled expression was too much for the CPU to predict the memory access pattern and prefetch parts of it to L1/2 caches. In effect you were not testing the call performance but how good the CPU caching strategies were performing.
You should refactor your test code to use a simpler call pattern and perhaps use Benchmark.NET to factor these things out. That gives results which are in line with your expectations:
Method | N | Mean |
--------------- |----- |---------:|
TestNormal | 1000 | 3.175 us |
TestExpression | 1000 | 3.480 us |
TestOther | 1000 | 4.325 us |
The direct call is fastest, next comes the expression and last the delegate approach. But that was a micro benchmark. Your actual performance numbers can be different as you have found at the beginning and even counter intuitive.

Your "normal" implementation
public override float Perform(TestObject obj)
{
return obj.data["A"] * obj.data["B"]
+ obj.data["C"] / obj.data["D"]
+ obj.data["E"] / (obj.data["E"] + obj.data["F"]);
}
is a bit inefficient. It calls obj.data["E"] twice, while the "other" implementation calls it only once. It you alter the code a bit
public override float Perform(TestObject obj)
{
var e = obj.data["E"];
return obj.data["A"] * obj.data["B"]
+ obj.data["C"] / obj.data["D"]
+ e / (e + obj.data["F"]);
}
it would perform as expected, slightly faster than the "other".

Timing C# code using Timer

Even though it is good to check performance of code in terms of algorithmic analysis and Big-Oh! notation i wanted to see how much it takes for the code to execute in my PC. I had initialized a List to 9999count and removed even elements out from the them. Sadly the timespan to execute this seems to be 0:0:0. Surprised by the result there must be something wrong in the way i time the execution. Could someone help me time the code correct?
IList<int> source = new List<int>(100);
for (int i = 0; i < 9999; i++)
{
source.Add(i);
}
TimeSpan startTime, duration;
startTime = Process.GetCurrentProcess().Threads[0].UserProcessorTime;
RemoveEven(ref source);
duration = Process.GetCurrentProcess().Threads[0].UserProcessorTime.Subtract(startTime);
Console.WriteLine(duration.Milliseconds);
Console.Read();

The most appropriate thing to use there would be Stopwatch - anything involving TimeSpan has nowhere near enough precision for this:
var watch = Stopwatch.StartNew();
// something to time
watch.Stop();
Console.WriteLine(watch.ElapsedMilliseconds);
However, a modern CPU is very fast, and it would not surprise me if it can remove them in that time. Normally, for timing, you need to repeat an operation a large number of times to get a reasonable measurement.
Aside: the ref in RemoveEven(ref source) is almost certainly not needed.

In .Net 2.0 you can use the Stopwatch class
IList<int> source = new List<int>(100);
for (int i = 0; i < 9999; i++)
{
source.Add(i);
}
Stopwatch watch = new Stopwatch();
watch.Start();
RemoveEven(ref source);
//watch.ElapsedMilliseconds contains the execution time in ms
watch.Stop()

Adding to previous answers:
var sw = Stopwatch.StartNew();
// instructions to time
sw.Stop();
sw.ElapsedMilliseconds returns a long and has a resolution of:
1 millisecond = 1000000 nanoseconds
sw.Elapsed.TotalMilliseconds returns a double and has a resolution equal to the inverse of Stopwatch.Frequency. On my PC for example Stopwatch.Frequency has a value of 2939541 ticks per second, that gives sw.Elapsed.TotalMilliseconds a resolution of:
1/2939541 seconds = 3,401891655874165e-7 seconds = 340 nanoseconds

How can I get CPU load per core in C#?

How can I get CPU Load per core (quadcore cpu), in C#?
Thanks :)

You can either use WMI or the System.Diagnostics namespace. From there you can grab any of the performance counters you wish (however it takes a second (1-1.5s) to initialize those - reading values is ok, only initialization is slow)
Code can look then like this:
using System.Diagnostics;
public static Double Calculate(CounterSample oldSample, CounterSample newSample)
{
double difference = newSample.RawValue - oldSample.RawValue;
double timeInterval = newSample.TimeStamp100nSec - oldSample.TimeStamp100nSec;
if (timeInterval != 0) return 100*(1 - (difference/timeInterval));
return 0;
}
static void Main()
{
var pc = new PerformanceCounter("Processor Information", "% Processor Time");
var cat = new PerformanceCounterCategory("Processor Information");
var instances = cat.GetInstanceNames();
var cs = new Dictionary<string, CounterSample>();
foreach (var s in instances)
{
pc.InstanceName = s;
cs.Add(s, pc.NextSample());
}
while (true)
{
foreach (var s in instances)
{
pc.InstanceName = s;
Console.WriteLine("{0} - {1:f}", s, Calculate(cs[s], pc.NextSample()));
cs[s] = pc.NextSample();
}
System.Threading.Thread.Sleep(500);
}
}
Important thing is that you cant rely on native .net calculation for 100nsInverse performance counters (returns only 0 or 100 for me ... bug?) but you have to calculate it yourself and for that you need an archive of last CounterSamples for each instance (instances represent a core or a sum of those cores).
There appears to be a naming convetion for those instances :
0,0 - first cpu first core
0,1 - first cpu second core
0,_Total - total load of first cpu
_Total - total load of all cpus
(not verified - would not recommend to rely on it untill further investigation is done)...

Since cores show up as seperate CPUs to the OS, you use the same code you'd use to determine the load per CPU in a multiprocessor machine. One such example (in C) is here. Note that it uses WMI, so the other thread linked in the comments above probably has you most of the way there.

First you need as many performance counters as cores (it takes time, you might want to do this in an async task, or it blocks you UI):
CoreCount= Environment.ProcessorCount;
CPUCounters = new PerformanceCounter[CoreCount];
for (var i = 0; i < CoreCount; i++)
CPUCounters[i] = new PerformanceCounter("Processor", "% Processor Time", $"{i}");
Then you can fetch the value of usage of each core:
public float[] CoreUsage { get; set; } // feel free to implement inpc
public void Update() => CoreUsage = CPUCounters.Select(o => o.NextValue()).ToArray();
However you might want to call Update() three times: the first time you'll always get 0.0, then second time always 0.0 or 100.0, and then at last the actual value.
NextValue() gives you the instant value, not the mean value over time; if you want the mean value say over 1 second, you can use RawValue and calculate the mean as explained here: PerformanceCounter.RawValue Property

What's the fastest way to get the partial value of a number in C#?

float f = 5.13;
double d = 5.13;
float fp = f - (float)Math.floor(f);
double dp = d - Math.floor(d);
Isn't there any faster way than calling an external function every time?

"External function"?
System.Math is built into mscorlib!
This is actually the fastest way to do this.

You could cast f to an int which would trim the fractional part. This presumes that your doubles fall within the range of an integer.
Of course, the jitter may be smart enough to optimize Math.floor to some inlined asm that'll do the floor, which may be faster than the cast to int then cast back to float.
Have you actually measured and verified that the performance of Math.floor is affecting your program? If you haven't, you shouldn't bother with this level of micro-optimization until you know that is a problem, and then measure the performance of this alternative against the original code.
EDIT: This does appear faster. The following code takes 717ms when using Math.Floor(), and 172 ms for the int casting code on my machine, in release mode. But again, I doubt the perf improvement really matters - to get this to be measurable I had to do 100m iterations. Also, I find Math.Floor() to be much more readable and obvious what the intent is, and a future CLR could emit more optimal code for Math.Floor and beat out this approach easily.
private static double Floor1Test()
{
// Keep track of results in total so ops aren't optimized away.
double total = 0;
for (int i = 0; i < 100000000; i++)
{
float f = 5.13f;
double d = 5.13;
float fp = f - (float)Math.Floor(f);
double dp = d - (float)Math.Floor(d);
total = fp + dp;
}
return total;
}
private static double Floor2Test()
{
// Keep track of total so ops aren't optimized away.
double total = 0;
for (int i = 0; i < 100000000; i++)
{
float f = 5.13f;
double d = 5.13;
float fp = f - (int)(f);
double dp = d - (int)(d);
total = fp + dp;
}
return total;
}
static void Main(string[] args)
{
System.Diagnostics.Stopwatch timer = new System.Diagnostics.Stopwatch();
// Unused run first, guarantee code is JIT'd.
timer.Start();
Floor1Test();
Floor2Test();
timer.Stop();
timer.Reset();
timer.Start();
Floor1Test();
timer.Stop();
long floor1time = timer.ElapsedMilliseconds;
timer.Reset();
timer.Start();
Floor2Test();
timer.Stop();
long floor2time = timer.ElapsedMilliseconds;
Console.WriteLine("Floor 1 - {0} ms", floor1time);
Console.WriteLine("Floor 2 - {0} ms", floor2time);
}
}

Donald E. Knuth said:
"We should forget about small efficiencies, say about 97% of the time: premature
optimization is the root of all evil."
So unless you have benchmarked your application and found positive evidence that this operations is the bottleneck, then don't bother optimizing these this line of code.

Well, I doubt you'll get any real world performance gain, but according to Reflector Math.Floor is this:
public static decimal Floor(decimal d)
{
return decimal.Floor(d);
}
So arguably
double dp = d - decimal.Floor(d);
may be quicker. (Compiler optimisations make the whole point moot I know...)
For those who may be interested to take this to its logical conclusion decimal.Floor is:
public static decimal Floor(decimal d)
{
decimal result = 0M;
FCallFloor(ref result, d);
return result;
}
with FCallFloor being a invoke to unmanaged code, so you are pretty much at the limit of the "optimisation" there.

In the case of Decimal, I would recommend ignoring everyone yelling not to change it and try using Decimal.Truncate. Whether it is faster or not, it is a function specifically intended for what you are trying to do and thus is a bit clearer.
Oh, and by the way, it is faster:
System.Diagnostics.Stopwatch foo = new System.Diagnostics.Stopwatch();
Decimal x = 1.5M;
Decimal y = 1;
int tests = 1000000;
foo.Start();
for (int z = 0; z < tests; ++z)
{
y = x - Decimal.Truncate(x);
}
foo.Stop();
Console.WriteLine(foo.ElapsedMilliseconds);
foo.Reset();
foo.Start();
for (int z = 0; z < tests; ++z)
{
y = x - Math.Floor(x);
}
foo.Stop();
Console.WriteLine(foo.ElapsedMilliseconds);
Console.ReadKey();
//Output: 123
//Output: 164
Edit: Fixed my explanation and code.

It is static so this should be really fast, there is no object to stand up. You can always to bit level math, but unless you have some serious use the function. Likely the floor() method is already doing this, but you could inline it and cut out checks etc if you need something really fast, but in C# this is not your greatest performance issue.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# performance analysis- how to count CPU cycles? - c#

Just a thought but sometimes identical machine code can take a different number of cycles to execute depending on its alignment in memory so you might want to add a control or controls.

Related

Why the time to execute a method in GPU is more compare to in CPU in hybridizer projects?

C#: How is this performing faster with multiple functions, than inline code?

Timing C# code using Timer

How can I get CPU load per core in C#?

What's the fastest way to get the partial value of a number in C#?

Categories

Resources