Using the Halcon 13 function FindNccModel in C# causes the following error:
HALCON error #6001: Not enough memory available in operator find_ncc_model
class Program
{
static void Main(string[] args)
{
HImage Image = new HImage(#"08_09_09_41_33_582_OK_000000153000.png");
double MidpointRow = 1053.5210373923057, MidpointCol = 1223.5205413999142;
int iCounter = 0;
while (true)
{
HNCCModel model = new HNCCModel(#"000000135000Mark_0.ncm");
HXLDCont hxCont = new HXLDCont();
hxCont.GenRectangle2ContourXld(
721.9213759213759,
1775.862648221344,
-0.99483767363676778,
72,
14.5);
HTuple htRowXLD, htColXLD;
hxCont.GetContourXld(out htRowXLD, out htColXLD);
HTuple htRadius = new HTuple();
htRadius = new HTuple(htRowXLD.TupleSub(MidpointRow).TuplePow(2) + htColXLD.TupleSub(MidpointCol).TuplePow(2)).TupleSqrt();
HRegion hrAnnulus = new HRegion();
hrAnnulus = hrAnnulus.GenAnnulus(MidpointRow, MidpointCol, htRadius.TupleMin() - 5.0, htRadius.TupleMax() + 5.0);
HImage hiTemp = Image.Clone();
HImage hiTemp2 = hiTemp.Rgb1ToGray();
HImage hiTemp3 = hiTemp2.ReduceDomain(hrAnnulus);
HTuple htRow, htColumn, Angle, Score;
model.FindNccModel(hiTemp3, -0.39, 6.29, 0.65, 1, 0, "true", 0, out htRow, out htColumn, out Angle, out Score);
hxCont.DisposeIfNotNull();
hrAnnulus.DisposeIfNotNull();
model.Dispose();
hiTemp.DisposeIfNotNull();
hiTemp2.DisposeIfNotNull();
hiTemp3.DisposeIfNotNull();
Console.WriteLine(iCounter++.ToString());
}
}
}
public static class DL_HalconUtilityClass
{
public static HRegion GenAnnulus(this HRegion region, double dCenterRow, double dCenterColumn, double dRadiusSmall, double dRadiusBig)
{
region.GenEmptyRegion();
if (dRadiusSmall > dRadiusBig)
{
throw new NotSupportedException("Wrong input parameters. Small radius is bigger than big radius.");
}
HRegion hrCircleSmall = new HRegion(dCenterRow, dCenterColumn, dRadiusSmall);
HRegion hrCircleBig = new HRegion(dCenterRow, dCenterColumn, dRadiusBig);
region = new HRegion();
region = hrCircleBig.Difference(hrCircleSmall);
hrCircleSmall.Dispose();
hrCircleBig.Dispose();
return region;
}
public static void DisposeIfNotNull(this HImage hiImage)
{
if (hiImage != null) hiImage.Dispose();
}
public static void DisposeIfNotNull(this HRegion hrRegion)
{
if (hrRegion != null) hrRegion.Dispose();
}
public static void DisposeIfNotNull(this HObject hoObject)
{
if (hoObject != null) hoObject.Dispose();
}
}
The function itself can run endlessly in an while loop, but if it's combined with our program it causes a memory exception. On the other hand the program itself can run endlessly without this function. It is also interesting that the error happens before the program reaches typical 1,1 Gb of memory which means that there is a memory leak.
I didn't find any references to this problem in Halcon documentation and upgrading to the newest Halcon 13 version or using Halcon XL did not help. Does anyone know what could cause this problem?
In your code you already manually dispose of most HALCON objects, as it is suggested to do. As you probably know this is necessary because the .NET garbage collector does not know about the amount of unmanaged memory handled by the HALCON library that might be used by the managed object.
However, you miss to Dispose the HTuples that contain the result of FindNccModel htRow, htColumn, Angle and Score.
You might also want to move the creation of the HNCCModel out of your while loop.
Halcon has two memory management optimization system settings: global_mem_cache and temporary_mem_cache. The global_mem_cache had no influence, but set the temporary_mem_cache parameter to "idle" or "shared" solved the problem.
Default setting is "exclusive" where temporary memory is cached locally for each thread. This is an excerpt from Halcon documentation:
'temporary_mem_cache' *), 'tsp_temporary_mem_cache'
This parameter controls the operating mode of the temporary memory cache. The temporary memory cache is used to speed up an application by caching memory used temporarily during the execution of an operator. For most applications the default setting ('exclusive') will produce the best results. The following modes are supported:
'idle' The temporary memory cache is turned off. This mode will use the least memory, but will also reduce performance compared to the other modes.
'shared' All temporary memory is cached globally in the temporary memory reservoir. This mode will use less memory than 'exclusive' mode, but will also generally offer less performance.
'exclusive' All temporary memory is cached locally for each thread. This mode will use the most memory, but will generally also offer the best performance.
'aggregate' Temporary memory blocks that are larger than the threshold set with the 'alloctmp_max_blocksize' parameter are cached in the global memory reservoir, while all smaller blocks are aggregated into a single block that is cached locally for each thread. If the global memory reservoir is disabled, the large blocks are freed instead. The aggregated block will be sized according to the temporary memory usage the thread has seen so far, but it will not be larger than 'alloctmp_max_blocksize' (if set) or smaller than 'alloctmp_min_blocksize' (if set). This mode balances memory usage and speed, but requires correctly setting 'alloctmp_min_blocksize' and 'alloctmp_max_blocksize' for the application's memory usage pattern for effectiveness.
Note that cache mode 'idle' is set in exclusive run mode, whereas the other modes are set in reentrant mode.
For backward compatibility, the values 'false' and 'true' are also accepted; they correspond to 'idle' and 'exclusive', respectively.
Related
I have a unity app, that should have been made in winforms, but I used unity, bc. The app loads external images, if they exist. the problem is that unity keeps using more and more ram until it crashes. I know its the images because I added a feature to turn off image loading, and that fixed the problem, I need help getting my app to stop using so much ram!!
here is the code I used, (its all not mine)
StartCoroutine("FindImage");
public IEnumerator FindImage()
{
var a = FolderLocationPath;
/*
while (!Main.EnableSelected)
{
yield return new WaitForSeconds(0.1f);
}*/
try
{
ModViewImage.sprite = LoadNewSprite(a + #"\preview.jpg");
}
catch
{
try
{
ModViewImage.sprite = LoadNewSprite(a + #"\preview.png");
}
catch
{
Debug.LogWarning("--No-Image--" + a.Replace(Main.GetMain.FolderPath, ""));
}
}
yield return null;
}
public Sprite LoadNewSprite(string FilePath, float PixelsPerUnit = 0.01f)
{
// Load a PNG or JPG image from disk to a Texture2D, assign this texture to a new sprite and return its reference
Texture2D SpriteTexture = LoadTexture(FilePath);
Sprite NewSprite = Sprite.Create(SpriteTexture, new Rect(0, 0, SpriteTexture.width, SpriteTexture.height), new Vector2(0, 0), PixelsPerUnit);
return NewSprite;
}
public Texture2D LoadTexture(string FilePath)
{
// Load a PNG or JPG file from disk to a Texture2D
// Returns null if load fails
Texture2D Tex2D;
byte[] FileData;
if (System.IO.File.Exists(FilePath))
{
FileData = System.IO.File.ReadAllBytes(FilePath);
Tex2D = new Texture2D(1, 1); // Create new "empty" texture
if (Tex2D.LoadImage(FileData))
{
Tex2D.anisoLevel = 1;
Tex2D.filterMode = FilterMode.Point;
Tex2D.Apply();
return Tex2D;
} // If data = readable -> return texture
}
return null; // Return null if load failed
}
In my personal opinion; the First thing you should always do when you encounter a memory/performance problem, is use the profiler.
There you will be able to see what is actually eating up all your RAM, and with the CPU profiler you can see garbage collection allocation to see if there are any suspects there.
I would suggest in the future to use the profiler as soon as you encounter a similar problem.
That being said, from the code you did include I don't think it's possible to tell where your problem is for certain.
Some general suggestions:
If you do not need to have read/write access to the texture data/pixels, send false in the 2nd parameter of Texture2D.LoadTexture in order to mark it as non-readable; this in effect lowers 50% of the RAM consumption per texture loaded into memory, and is only needed if you manipulate/read data from the texture.
From the usage of the legacy LoadTexture function, I think you are running on Unity 5.3 or so; I would strongly recommend updating your Unity to a more... from the current decade, which will most likely improve your performance overall, and address various memory leaks in the engine that might be causing this.
Everything in managed code needs to be dereferenced when you are done with it; if you are loading theses images and keeping a reference somewhere, they will not be freed from memory and this might also be a reason for your problem.
I store all of my profiles into a profileCache, which eats up a ton of memory within the Large Object Heap. Therefore, I have implemented a method to help delete unused cache. The problem is the method doesn't seem to be clearing the cache correctly and is throwing a stack overflow error. Here is the two methods I have implemented.
private static void OnScavengeProfileCache(object data)
{
// loop until the runtime is shutting down
while(HostingEnvironment.ShutdownReason == ApplicationShutdownReason.None)
{
// NOTE: Right now we only do the scavenge when traffic is temporarily low,
// to try to amortize the overhead of scavenging the cache into a low utilization period.
// We also only scavenge if the process memory usage is very high.
if (s_timerNoRequests.ElapsedMilliseconds >= 10000)
{
// We dont want to scavenge under lock to avoid slowing down requests,
// so we get the list of keys under lock and then incrementally scan them
IEnumerable<string> profileKeys = null;
lock (s_profileCache)
{
profileKeys = s_profileCache.Keys.ToList();
}
ScavengeProfileCacheIncremental(profileKeys.GetEnumerator());
}
// wait for a bit
Thread.Sleep(60 * 1000);
}
}
My method is constantly scanning traffic, and when traffic is low, it collects all of my profiles and stores them into an IEnumerable called profileKeys. I then invoke this method to delete unused keys -
private static void ScavengeProfileCacheIncremental(IEnumerator<string> profileKeys)
{
if (s_thisProcess.PrivateMemorySize64 >= (200 * 1024 * 1024) ) // 3Gb at least
{
int numProcessed = 0;
while(profileKeys.MoveNext())
{
var key = profileKeys.Current;
Profile profile = null;
if (s_profileCache.TryGetValue(key, out profile))
{
// safely check/remove under lock, its fast but makes sure we dont blow away someone currently being addded
lock (s_profileCache)
{
if (DateTime.UtcNow.Subtract(profile.CreateTime).TotalMinutes > 5)
{
// can clear it out
s_profileCache.Remove(key);
}
}
}
if (++numProcessed >= 5)
{
// stop this scan and check memory again
break;
}
}
// Check again to see if we freed up memory, if not continue scanning the profiles?
ScavengeProfileCacheIncremental(profileKeys);
}
}
The method is not clearing up memory and is throwing a stack overflow error with this trace:
192. ProfileHelper.ScavengeProfileCacheIncremental(
193. ProfileHelper.ScavengeProfileCacheIncremental(
194. ProfileHelper.ScavengeProfileCacheIncremental(
195. ProfileHelper.ScavengeProfileCacheIncremental(
196. ProfileHelper.OnScavengeProfileCache(...)
197. ExecutionContext.RunInternal(...)
198. ExecutionContext.Run(...)
199. IThreadPoolWorkItem.ExecuteWorkItem(...)
200. ThreadPoolWorkQueue.Dispatch(...)
EDIT:
So would this be a possible solution to remove unused profile keys and clear LOH...
private static void ScavengeProfileCacheIncremental(IEnumerator<string> profileKeys)
{
if (s_thisProcess.PrivateMemorySize64 >= (200 * 1024 * 1024) ) // 3Gb at least
{
int numProcessed = 0;
while(profileKeys.MoveNext())
{
var key = profileKeys.Current;
Profile profile = null;
if (s_profileCache.TryGetValue(key, out profile))
{
// safely check/remove under lock, its fast but makes sure we dont blow away someone currently being addded
lock (s_profileCache)
{
if (DateTime.UtcNow.Subtract(profile.CreateTime).TotalMinutes > 5)
{
// can clear it out
s_profileCache.Remove(key);
}
}
}
if (++numProcessed >= 5)
{
// stop this scan and check memory again
break;
}
}
}
GC.Collect;
}
I believe your code is suffering of a problem known as Infinite Recursion.
You are calling method ScavengeProfileCacheIncremental, which in turn calls itself internally. At some point, you call into it enough times that you run out of stack, causing an overflow.
Either your condition is not being met before you run out of stack, or your condition is never met at all. Debugging should show you why.
You can read more on the subject here.
There is no exit from SaveProfileCacheIncremental.
It does its stuff and then calls itself. It then does its stuff and calls itself. It then does its stuff and calls itself. It then does its stuff and calls itself. It then does its stuff and calls itself.
After a while it uses all the stack space and the process crashes.
In the worst case, does this sample allocate testCnt * xArray.Length storage in the GPU global memory? How to make sure just one copy of the array is transferred to the device? The GpuManaged attribute seems to serve this purpose but it doesn't solve our unexpected memory consumption.
void Worker(int ix, byte[] array)
{
// process array - only read access
}
void Run()
{
var xArray = new byte[100];
var testCnt = 10;
Gpu.Default.For(0, testCnt, ix => Worker(ix, xArray));
}
EDIT
The main question in a more precise form:
Does each worker thread get a fresh copy of xArray or is there only one copy of xArray for all threads?
Your sample code should allocate 100 bytes of memory on the GPU and 100 bytes of memory on the CPU.
(.Net adds a bit of overhead, but we can ignore that)
Since you're using implicit memory, some resources need to be allocated to track that memory, (basically where it lives: CPU/GPU).
Now... You're probably seeing a bigger memory consumption on the CPU side I assume.
The reason for that is possibly due to kernel compilation happening on the fly.
AleaGPU has to compile your IL code into LLVM, that LLVM is fed into the Cuda compiler which in turn converts it into PTX.
This happens when you run a kernel for the first time.
All of the resources and unmanaged dlls are loaded into memory.
That's possibly what you're seeing.
testCnt has no effect on the amount of memory being allocated.
EDIT*
One suggestion is to use memory in an explicit way.
Its faster and more efficient:
private static void Run()
{
var input = Gpu.Default.AllocateDevice<byte>(100);
var deviceptr = input.Ptr;
Gpu.Default.For(0, input.Length, i => Worker(i, deviceptr));
Console.WriteLine(string.Join(", ", Gpu.CopyToHost(input)));
}
private static void Worker(int ix, deviceptr<byte> array)
{
array[ix] = 10;
}
Try use explicit memory:
static void Worker(int ix, byte[] array)
{
// you must write something back, note, I changed your Worker
// function to static!
array[ix] += 1uy;
}
void Run()
{
var gpu = Gpu.Default;
var hostArray = new byte[100];
// set your host array
var deviceArray = gpu.Allocate<byte>(100);
// deviceArray is of type byte[], but deviceArray.Length = 0,
assert deviceArray.Length == 0
assert Gpu.ArrayGetLength(deviceArray) == 100
Gpu.Copy(hostArray, deviceArray);
var testCnt = 10;
gpu.For(0, testCnt, ix => Worker(ix, deviceArray));
// you must copy memory back
Gpu.Copy(deviceArray, hostArray);
// check your result in hostArray
Gpu.Free(deviceArray);
}
When invoking UpdatePerformanceCounters: In this updater all the counter names for the category and instance counters are the same - they are always derived from an Enum. The updater is passed a "profile" typically with content such as:
{saTrilogy.Core.Instrumentation.PerformanceCounterProfile}
_disposed: false
CategoryDescription: "Timed function for a Data Access process"
CategoryName: "saTrilogy<Core> DataAccess Span"
Duration: 405414
EndTicks: 212442328815
InstanceName: "saTrilogy.Core.DataAccess.ParameterCatalogue..ctor::[dbo].[sp_KernelProcedures]"
LogFormattedEntry: "{\"CategoryName\":\"saTrilogy<Core> DataAccess ...
StartTicks: 212441923401
Note the "complexity" of the Instance name.
The toUpdate.AddRange() of the VerifyCounterExistence method always succeeds and produces the "expected" output so the UpdatePerformanceCounters method continues through to the "successful" incrementing of the counters.
Despite the "catch" this never "fails" - except, when viewing the Category in PerfMon, it shows no instances or, therefore, any "successful" update of an instance counter.
I suspect my problem may be that my instance name is being rejected, without exception, because of its "complexity" - when I run this through a console tester via PerfView it does not show any exception stack and the ETW events associated with counter updates are successfully recorded in an out-of-process sink. Also, there are no entries in the Windows Logs.
This is all being run "locally" via VS2012 on a Windows 2008R2 server with NET 4.5.
Does anyone have any ideas of how else I may try this - or even test if the "update" is being accepted by PerfMon?
public sealed class Performance {
private enum ProcessCounterNames {
[Description("Total Process Invocation Count")]
TotalProcessInvocationCount,
[Description("Average Process Invocation Rate per second")]
AverageProcessInvocationRate,
[Description("Average Duration per Process Invocation")]
AverageProcessInvocationDuration,
[Description("Average Time per Process Invocation - Base")]
AverageProcessTimeBase
}
private readonly static CounterCreationDataCollection ProcessCounterCollection = new CounterCreationDataCollection{
new CounterCreationData(
Enum<ProcessCounterNames>.GetName(ProcessCounterNames.TotalProcessInvocationCount),
Enum<ProcessCounterNames>.GetDescription(ProcessCounterNames.TotalProcessInvocationCount),
PerformanceCounterType.NumberOfItems32),
new CounterCreationData(
Enum<ProcessCounterNames>.GetName(ProcessCounterNames.AverageProcessInvocationRate),
Enum<ProcessCounterNames>.GetDescription(ProcessCounterNames.AverageProcessInvocationRate),
PerformanceCounterType.RateOfCountsPerSecond32),
new CounterCreationData(
Enum<ProcessCounterNames>.GetName(ProcessCounterNames.AverageProcessInvocationDuration),
Enum<ProcessCounterNames>.GetDescription(ProcessCounterNames.AverageProcessInvocationDuration),
PerformanceCounterType.AverageTimer32),
new CounterCreationData(
Enum<ProcessCounterNames>.GetName(ProcessCounterNames.AverageProcessTimeBase),
Enum<ProcessCounterNames>.GetDescription(ProcessCounterNames.AverageProcessTimeBase),
PerformanceCounterType.AverageBase),
};
private static bool VerifyCounterExistence(PerformanceCounterProfile profile, out List<PerformanceCounter> toUpdate) {
toUpdate = new List<PerformanceCounter>();
bool willUpdate = true;
try {
if (!PerformanceCounterCategory.Exists(profile.CategoryName)) {
PerformanceCounterCategory.Create(profile.CategoryName, profile.CategoryDescription, PerformanceCounterCategoryType.MultiInstance, ProcessCounterCollection);
}
toUpdate.AddRange(Enum<ProcessCounterNames>.GetNames().Select(counterName => new PerformanceCounter(profile.CategoryName, counterName, profile.InstanceName, false) { MachineName = "." }));
}
catch (Exception error) {
Kernel.Log.Trace(Reflector.ResolveCaller<Performance>(), EventSourceMethods.Kernel_Error, new PacketUpdater {
Message = StandardMessage.PerformanceCounterError,
Data = new Dictionary<string, object> { { "Instance", profile.LogFormattedEntry } },
Error = error
});
willUpdate = false;
}
return willUpdate;
}
public static void UpdatePerformanceCounters(PerformanceCounterProfile profile) {
List<PerformanceCounter> toUpdate;
if (profile.Duration <= 0 || !VerifyCounterExistence(profile, out toUpdate)) {
return;
}
foreach (PerformanceCounter counter in toUpdate) {
if (Equals(PerformanceCounterType.RateOfCountsPerSecond32, counter.CounterType)) {
counter.IncrementBy(profile.Duration);
}
else {
counter.Increment();
}
}
}
}
From MSDN .Net 4.5 PerformanceCounter.InstanceName Property (http://msdn.microsoft.com/en-us/library/system.diagnostics.performancecounter.instancename.aspx)...
Note: Instance names must be shorter than 128 characters in length.
Note: Do not use the characters "(", ")", "#", "\", or "/" in the instance name. If any of these characters are used, the Performance Console (see Runtime Profiling) may not correctly display the instance values.
The instance name of 79 characters that I use above satisfies these conditions so, unless ".", ":", "[" and "]" are also "reserved" the name would not appear to be the issue. I also tried a 64 character sub-string of the instance name - just in case, as well as a plain "test" string all to no avail.
Changes...
Apart from the Enum and the ProcessCounterCollection I have replaced the class body with the following:
private static readonly Dictionary<string, List<PerformanceCounter>> definedInstanceCounters = new Dictionary<string, List<PerformanceCounter>>();
private static void UpdateDefinedInstanceCounterDictionary(string dictionaryKey, string categoryName, string instanceName = null) {
definedInstanceCounters.Add(
dictionaryKey,
!PerformanceCounterCategory.InstanceExists(instanceName ?? "Total", categoryName)
? Enum<ProcessCounterNames>.GetNames().Select(counterName => new PerformanceCounter(categoryName, counterName, instanceName ?? "Total", false) { RawValue = 0, MachineName = "." }).ToList()
: PerformanceCounterCategory.GetCategories().First(category => category.CategoryName == categoryName).GetCounters().Where(counter => counter.InstanceName == (instanceName ?? "Total")).ToList());
}
public static void InitialisationCategoryVerify(IReadOnlyCollection<PerformanceCounterProfile> etwProfiles){
foreach (PerformanceCounterProfile profile in etwProfiles){
if (!PerformanceCounterCategory.Exists(profile.CategoryName)){
PerformanceCounterCategory.Create(profile.CategoryName, profile.CategoryDescription, PerformanceCounterCategoryType.MultiInstance, ProcessCounterCollection);
}
UpdateDefinedInstanceCounterDictionary(profile.DictionaryKey, profile.CategoryName);
}
}
public static void UpdatePerformanceCounters(PerformanceCounterProfile profile) {
if (!definedInstanceCounters.ContainsKey(profile.DictionaryKey)) {
UpdateDefinedInstanceCounterDictionary(profile.DictionaryKey, profile.CategoryName, profile.InstanceName);
}
definedInstanceCounters[profile.DictionaryKey].ForEach(c => c.IncrementBy(c.CounterType == PerformanceCounterType.AverageTimer32 ? profile.Duration : 1));
definedInstanceCounters[profile.TotalInstanceKey].ForEach(c => c.IncrementBy(c.CounterType == PerformanceCounterType.AverageTimer32 ? profile.Duration : 1));
}
}
In the PerformanceCounter Profile I've added:
internal string DictionaryKey {
get {
return String.Concat(CategoryName, " - ", InstanceName ?? "Total");
}
}
internal string TotalInstanceKey {
get {
return String.Concat(CategoryName, " - Total");
}
}
The ETW EventSource now does the initialisation for the "pre-defined" performance categories whilst also creating an instance called "Total".
PerformanceCategoryProfile = Enum<EventSourceMethods>.GetValues().ToDictionary(esm => esm, esm => new PerformanceCounterProfile(String.Concat("saTrilogy<Core> ", Enum<EventSourceMethods>.GetName(esm).Replace("_", " ")), Enum<EventSourceMethods>.GetDescription(esm)));
Performance.InitialisationCategoryVerify(PerformanceCategoryProfile.Values.Where(v => !v.CategoryName.EndsWith("Trace")).ToArray());
This creates all of the categories, as expected, but in PerfMon I still cannot see any instances - even the "Total" instance and the update always, apparently, runs without error.
I don't know what else I can "change - probably "too close" to the problem and would appreciate comments/corrections.
These are the conclusions and the "answer" insofar as as it explains, to the best of my ability, what I believe is happening and posted by myself - given my recent helpful use of Stack Overflow this, I hope, will be of use to others...
Firstly, there is essentially nothing wrong with the code displayed excepting one proviso - mentioned later. Putting a Console.ReadKey() before program termination and after having done a PerformanceCounterCategory(categoryKey).ReadCategory() it is quite clear that not only are the registry entries correct (for this is where ReadCategory sources its results) but that the instance counters have all been incremented by the appropriate values. If one looks at PerfMon before the program terminates the instance counters are there and they do contain the appropriate Raw Values.
This is the crux of my "problem" - or, rather, my incomplete understanding of the architecture: INSTANCE COUNTERS ARE TRANSIENT - INSTANCES ARE NOT PERSISTED BEYOND THE TERMINATION OF A PROGRAM/PROCESS. This, once it dawned on me, is "obvious" - for example, try using PerfMon to look at an instance counter of one of your IIS AppPools - then stop the AppPool and you will see, in PerfMon, that the Instance for the stopped AppPool is no longer visible.
Given this axiom about instance counters the code above has another completely irrelevant section: When trying the method UpdateDefinedInstanceCounterDictionary assigning the list from an existing counter set is pointless. Firstly, the "else" code shown will fail since we are attempting to return a collection of (instance) counters for which this approach will not work and, secondly, the GetCategories() followed by GetCounters() or/and GetInstanceNames() is an extraordinarily expensive and time-consuming process - even if it were to work. The appropriate method to use is the one mentioned earlier - PerformanceCounterCategory(categoryKey).ReadCategory(). However, this returns an InstanceDataCollectionCollection which is effectively read-only so, as a provider (as opposed to a consumer) of counters it is pointless. In fact, it doesn't matter if you just use the Enum generated new PerformanceCounter list - it works regardless of whether the counters already exist or not.
Anyway, the InstanceDataCollectionCollection (this is essentially that which is demonstrated by the Win32 SDK for .Net 3.5 "Usermode Counter Sample") uses a "Sample" counter which is populated and returned - as per the usage of the System.Diagnostics.PerformanceData Namespace whichi looks like part of the Version 2.0 usage - which usage is "incompatible" with the System.Diagnostics.PerformanceCounterCategory usage shown.
Admittedly, the fact of non-persistance may seem obvious and may well be stated in documentation but, if I were to read all the documentation about everything I need to use beforehand I'd probably end up not actually writing any code! Furthermore, even if such pertinent documentation were easy to find (as opposed to experiences posted on, for example, Stack Overflow) I'm not sure I trust all of it. For example, I noted above that the instance name in the MSDN documentation has a 128 character limit - wrong; it is actually 127 since the underlying string must be null-terminated. Also, for example, for ETW, I wish it were made more obvious that keyword values must be powers of 2 and opcodes with value of less than 12 are used by the system - at least PerfView was able to show me this.
Ultimately this question has no "answer" other than a better understanding of instance counters - especially their persistence. Since my code is intended for use in a Windows Service based Web API then its persistence is not an issue (especially with daily use of LogMan etc.) - the confusing thing is that the damn things didn't appear until I paused the code and checked PerfMon and I could have saved myself a lot of time and hassle if I knew this beforehand. In any event my ETW event source logs all elapsed execution times and instances of what the performance counters "monitor" anyway.
I was seeing some strange behavior in a multi threading application which I wrote and which was not scaling well across multiple cores.
The following code illustrates the behavior I am seeing. It appears the heap intensive operations do not scale across multiple cores rather they seem to slow down. ie using a single thread would be faster.
class Program
{
public static Data _threadOneData = new Data();
public static Data _threadTwoData = new Data();
public static Data _threadThreeData = new Data();
public static Data _threadFourData = new Data();
static void Main(string[] args)
{
// Do heap intensive tests
var start = DateTime.Now;
RunOneThread(WorkerUsingHeap);
var finish = DateTime.Now;
var timeLapse = finish - start;
Console.WriteLine("One thread using heap: " + timeLapse);
start = DateTime.Now;
RunFourThreads(WorkerUsingHeap);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("Four threads using heap: " + timeLapse);
// Do stack intensive tests
start = DateTime.Now;
RunOneThread(WorkerUsingStack);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("One thread using stack: " + timeLapse);
start = DateTime.Now;
RunFourThreads(WorkerUsingStack);
finish = DateTime.Now;
timeLapse = finish - start;
Console.WriteLine("Four threads using stack: " + timeLapse);
Console.ReadLine();
}
public static void RunOneThread(ParameterizedThreadStart worker)
{
var threadOne = new Thread(worker);
threadOne.Start(_threadOneData);
threadOne.Join();
}
public static void RunFourThreads(ParameterizedThreadStart worker)
{
var threadOne = new Thread(worker);
threadOne.Start(_threadOneData);
var threadTwo = new Thread(worker);
threadTwo.Start(_threadTwoData);
var threadThree = new Thread(worker);
threadThree.Start(_threadThreeData);
var threadFour = new Thread(worker);
threadFour.Start(_threadFourData);
threadOne.Join();
threadTwo.Join();
threadThree.Join();
threadFour.Join();
}
static void WorkerUsingHeap(object state)
{
var data = state as Data;
for (int count = 0; count < 100000000; count++)
{
var property = data.Property;
data.Property = property + 1;
}
}
static void WorkerUsingStack(object state)
{
var data = state as Data;
double dataOnStack = data.Property;
for (int count = 0; count < 100000000; count++)
{
dataOnStack++;
}
data.Property = dataOnStack;
}
public class Data
{
public double Property
{
get;
set;
}
}
}
This code was run on a Core 2 Quad (4 core system) with the following results:
One thread using heap: 00:00:01.8125000
Four threads using heap: 00:00:17.7500000
One thread using stack: 00:00:00.3437500
Four threads using stack: 00:00:00.3750000
So using the heap with four threads did 4 times the work but took almost 10 times as long. This means it would be twice as fast in this case to use only one thread??????
Using the stack was much more as expected.
I would like to know what is going on here. Can the heap only be written to from one thread at a time?
The answer is simple - run outside of Visual Studio...
I just copied your entire program, and ran it on my quad core system.
Inside VS (Release Build):
One thread using heap: 00:00:03.2206779
Four threads using heap: 00:00:23.1476850
One thread using stack: 00:00:00.3779622
Four threads using stack: 00:00:00.5219478
Outside VS (Release Build):
One thread using heap: 00:00:00.3899610
Four threads using heap: 00:00:00.4689531
One thread using stack: 00:00:00.1359864
Four threads using stack: 00:00:00.1409859
Note the difference. The extra time in the build outside VS is pretty much all due to the overhead of starting the threads. Your work in this case is too small to really test, and you're not using the high performance counters, so it's not a perfect test.
Main rule of thumb - always do perf. testing outside VS, ie: use Ctrl+F5 instead of F5 to run.
Aside from the debug-vs-release effects, there is something more you should be aware of.
You cannot effectively evaluate multi-threaded code for performance in 0.3s.
The point of threads is two-fold: effectively model parallel work in code, and effectively exploit parallel resources (cpus, cores).
You are trying to evaluate the latter. Given that thread start overhead is not vanishingly small in comparison to the interval over which you are timing, your measurement is immediately suspect. In most perf test trials, a significant warm up interval is appropriate. This may sound silly to you - it's a computer program fter all, not a lawnmower. But warm-up is absolutely imperative if you are really going to evaluate multi-thread performance. Caches get filled, pipelines fill up, pools get filled, GC generations get filled. The steady-state, continuous performance is what you would like to evaluate. For purposes of this exercise, the program behaves like a lawnmower.
You could say - Well, no, I don't want to evaluate the steady state performance. And if that is the case, then I would say that your scenario is very specialized. Most app scenarios, whether their designers explicitly realize it or not, need continuous, steady performance.
If you truly need the perf to be good only over a single 0.3s interval, you have found your answer. But be careful to not generalize the results.
If you want general results, you need to have reasonably long warm up intervals, and longer collection intervals. You might start at 20s/60s for those phases, but here is the key thing: you need to vary those intervals until you find the results converging. YMMV. The valid times vary depending on the application workload and the resources dedicated to it, obviously. You may find that a measurement interval of 120s is necessary for convergence, or you may find 40s is just fine. But (a) you won't know until you measure it, and (b) you can bet 0.3s is not long enough.
[edit]Turns out, this is a release vs. debug build issue -- not sure why it is, but it is. See comments and other answers.[/edit]
This was very interesting -- I wouldn't have guessed there'd be that much difference. (similar test machine here -- Core 2 Quad Q9300)
Here's an interesting comparison -- add a decent-sized additional element to the 'Data' class -- I changed it to this:
public class Data
{
public double Property { get; set; }
public byte[] Spacer = new byte[8096];
}
It's still not quite the same time, but it's very close (running it for 10x as long results in 13.1s vs. 17.6s on my machine).
If I had to guess, I'd speculate that it's related to cross-core cache coherency, at least if I'm remembering how CPU cache works. With the small version of 'Data', if a single cache line contains multiple instances of Data, the cores are having to constantly invalidate each other's caches (worst case if they're all on the same cache line). With the 'spacer' added, their memory addresses are sufficiently far enough apart that one CPU's write of a given address doesn't invalidate the caches of the other CPUs.
Another thing to note -- the 4 threads start nearly concurrently, but they don't finish at the same time -- another indication that there's cross-core issues at work here. Also, I'd guess that running on a multi-cpu machine of a different architecture would bring more interesting issues to light here.
I guess the lesson from this is that in a highly-concurrent scenario, if you're doing a bunch of work with a few small data structures, you should try to make sure they aren't all packed on top of each other in memory. Of course, there's really no way to make sure of that, but I'm guessing there are techniques (like adding spacers) that could be used to try to make it happen.
[edit]
This was too interesting -- I couldn't put it down. To test this out further, I thought I'd try varying-sized spacers, and use an integer instead of a double to keep the object without any added spacers smaller.
class Program
{
static void Main(string[] args)
{
Console.WriteLine("name\t1 thread\t4 threads");
RunTest("no spacer", WorkerUsingHeap, () => new Data());
var values = new int[] { -1, 0, 4, 8, 12, 16, 20 };
foreach (var sv in values)
{
var v = sv;
RunTest(string.Format(v == -1 ? "null spacer" : "{0}B spacer", v), WorkerUsingHeap, () => new DataWithSpacer(v));
}
Console.ReadLine();
}
public static void RunTest(string name, ParameterizedThreadStart worker, Func<object> fo)
{
var start = DateTime.UtcNow;
RunOneThread(worker, fo);
var middle = DateTime.UtcNow;
RunFourThreads(worker, fo);
var end = DateTime.UtcNow;
Console.WriteLine("{0}\t{1}\t{2}", name, middle-start, end-middle);
}
public static void RunOneThread(ParameterizedThreadStart worker, Func<object> fo)
{
var data = fo();
var threadOne = new Thread(worker);
threadOne.Start(data);
threadOne.Join();
}
public static void RunFourThreads(ParameterizedThreadStart worker, Func<object> fo)
{
var data1 = fo();
var data2 = fo();
var data3 = fo();
var data4 = fo();
var threadOne = new Thread(worker);
threadOne.Start(data1);
var threadTwo = new Thread(worker);
threadTwo.Start(data2);
var threadThree = new Thread(worker);
threadThree.Start(data3);
var threadFour = new Thread(worker);
threadFour.Start(data4);
threadOne.Join();
threadTwo.Join();
threadThree.Join();
threadFour.Join();
}
static void WorkerUsingHeap(object state)
{
var data = state as Data;
for (int count = 0; count < 500000000; count++)
{
var property = data.Property;
data.Property = property + 1;
}
}
public class Data
{
public int Property { get; set; }
}
public class DataWithSpacer : Data
{
public DataWithSpacer(int size) { Spacer = size == 0 ? null : new byte[size]; }
public byte[] Spacer;
}
}
Result:
1 thread vs. 4 threads
no spacer 00:00:06.3480000 00:00:42.6260000
null spacer 00:00:06.2300000 00:00:36.4030000
0B spacer 00:00:06.1920000 00:00:19.8460000
4B spacer 00:00:06.1870000 00:00:07.4150000
8B spacer 00:00:06.3750000 00:00:07.1260000
12B spacer 00:00:06.3420000 00:00:07.6930000
16B spacer 00:00:06.2250000 00:00:07.5530000
20B spacer 00:00:06.2170000 00:00:07.3670000
No spacer = 1/6th the speed, null spacer = 1/5th the speed, 0B spacer = 1/3th the speed, 4B spacer = full speed.
I don't know the full details of how the CLR allocates or aligns objects, so I can't speak to what these allocation patterns look like in real memory, but these definitely are some interesting results.