Threads increase abnormally in linux service - c#

I have a service that runs in linux under SystemD but gets compiled and debugged in VS22 under Windows.
The service is mainly a proxy to a MariaDB10 database shaped as a BackgroundWorker serving clients via SignalR.
If I run it in relase mode on Windows, the number of logical threads remains in a reasonable value (20-25 approx). See pic below.
Under linux, after few minutes (i cannot give you more insight unfortuantely... i still have to figure out what could be changing) the number of threads start increasing constantly every second.
see pic here arriving already to more than 100 and still counting:
Reading current logical threads increasing / thread stack is leaking i got confirmed that the CLR is allowing new threads if the others are not completing, but there is currently no change in the code when moving from Windows to Linux.
This is the HostBuilder with the call to SystemD
 public static IHostBuilder CreateWebHostBuilder(string[] args)
        {
            string curDir = MondayConfiguration.DefineCurrentDir();
            IConfigurationRoot config = new ConfigurationBuilder()
                // .SetBasePath(Directory.GetCurrentDirectory())
                .SetBasePath(curDir)
                .AddJsonFile("servicelocationoptions.json", optional: false, reloadOnChange: true)
#if DEBUG
                   .AddJsonFile("appSettings.Debug.json")
#else
                   .AddJsonFile("appSettings.json")
#endif
                   .Build();
            return Host.CreateDefaultBuilder(args)
                .UseContentRoot(curDir)
                .ConfigureAppConfiguration((_, configuration) =>
                {
                    configuration
                    .AddIniFile("appSettings.ini", optional: true, reloadOnChange: true)
#if DEBUG
                   .AddJsonFile("appSettings.Debug.json")
#else
                   .AddJsonFile("appSettings.json")
#endif
                    .AddJsonFile("servicelocationoptions.json", optional: false, reloadOnChange: true);
                })
                .UseSerilog((_, services, configuration) => configuration
                    .ReadFrom.Configuration(config, sectionName: "AppLog")// (context.Configuration)
                    .ReadFrom.Services(services)
                    .Enrich.FromLogContext()
                    .WriteTo.Console())
                // .UseSerilog(MondayConfiguration.Logger)
                .ConfigureServices((hostContext, services) =>
                {
                    services
                    .Configure<ServiceLocationOptions>(hostContext.Configuration.GetSection(key: nameof(ServiceLocationOptions)))
                    .Configure<HostOptions>(opts => opts.ShutdownTimeout = TimeSpan.FromSeconds(30));
                })
                .ConfigureWebHostDefaults(webBuilder =>
                {
                    webBuilder.UseStartup<Startup>();
                    ServiceLocationOptions locationOptions = config.GetSection(nameof(ServiceLocationOptions)).Get<ServiceLocationOptions>();
                    string url = locationOptions.HttpBase + "*:" + locationOptions.Port;
                    webBuilder.UseUrls(url);
                })
                .UseSystemd();
        }
In the meantime I am trying to trace all the Monitor.Enter() that I use to render serial the API endpoints that touch the state of the service and the inner structures, but in Windows seems all ok.
I am starting wondering if the issue in the call to SystemD. I would like to know what is really involved in a call to UseSystemD() but there is not so much documentation around.
I did just find [https://devblogs.microsoft.com/dotnet/net-core-and-systemd/] (https://devblogs.microsoft.com/dotnet/net-core-and-systemd/) by Glenn Condron and few quick notes on MSDN.
EDIT 1: To debug further I created a class to scan the threadpool using ClrMd.
My main service has an heartbeat (weird it is called Ping) as follows (not the add to processTracker.Scan()):
private async Task Ping()
{
await _containerServer.SyslogQueue.Writer.WriteAsync((
LogLevel.Information,
$"Monday Service active at: {DateTime.UtcNow.ToLocalTime()}"));
string processMessage = ProcessTracker.Scan();
await _containerServer.SyslogQueue.Writer.WriteAsync((LogLevel.Information, processMessage));
_logger.DebugInfo()
.Information("Monday Service active at: {Now}", DateTime.UtcNow.ToLocalTime());
}
where the processTrackes id constructed like this:
public static class ProcessTracker
{
static ProcessTracker()
{
}
public static string Scan()
{
// see https://stackoverflow.com/questions/31633541/clrmd-throws-exception-when-creating-runtime/31745689#31745689
StringBuilder sb = new();
string answer = $"Active Threads{Environment.NewLine}";
// Create the data target. This tells us the versions of CLR loaded in the target process.
int countThread = 0;
var pid = Process.GetCurrentProcess().Id;
using (var dataTarget = DataTarget.AttachToProcess(pid, 5000, AttachFlag.Passive))
{
// Note I just take the first version of CLR in the process. You can loop over
// every loaded CLR to handle the SxS case where both desktop CLR and .Net Core
// are loaded in the process.
ClrInfo version = dataTarget.ClrVersions[0];
var runtime = version.CreateRuntime();
// Walk each thread in the process.
foreach (ClrThread thread in runtime.Threads)
{
try
{
sb = new();
// The ClrRuntime.Threads will also report threads which have recently
// died, but their underlying data structures have not yet been cleaned
// up. This can potentially be useful in debugging (!threads displays
// this information with XXX displayed for their OS thread id). You
// cannot walk the stack of these threads though, so we skip them here.
if (!thread.IsAlive)
continue;
sb.Append($"Thread {thread.OSThreadId:X}:");
countThread++;
// Each thread tracks a "last thrown exception". This is the exception
// object which !threads prints. If that exception object is present, we
// will display some basic exception data here. Note that you can get
// the stack trace of the exception with ClrHeapException.StackTrace (we
// don't do that here).
ClrException? currException = thread.CurrentException;
if (currException is ClrException ex)
sb.AppendLine($"Exception: {ex.Address:X} ({ex.Type.Name}), HRESULT={ex.HResult:X}");
// Walk the stack of the thread and print output similar to !ClrStack.
sb.AppendLine(" ------> Managed Call stack:");
var collection = thread.EnumerateStackTrace().ToList();
foreach (ClrStackFrame frame in collection)
{
// Note that CLRStackFrame currently only has three pieces of data:
// stack pointer, instruction pointer, and frame name (which comes
// from ToString). Future versions of this API will allow you to get
// the type/function/module of the method (instead of just the
// name). This is not yet implemented.
sb.AppendLine($" {frame}");
}
}
catch
{
//skip to the next
}
finally
{
answer += sb.ToString();
}
}
}
answer += $"{Environment.NewLine} Total thread listed: {countThread}";
return answer;
}
}
All fine, in Windows it prints a lot of nice information in some kind of tree textual view.
The point is that somewhere it requires Kernel32.dll and in linux that is not available. Can someone give hints on this? The service is published natively without .NET infrastructure, in release mode, arch linux64, single file.
thanks a lot
Alex

I found a way to skip the whole logging of what I needed from a simple debug session.
I was not aware I could attach also to a Systemd process remotely.
Just followed https://learn.microsoft.com/en-us/visualstudio/debugger/remote-debugging-dotnet-core-linux-with-ssh?view=vs-2022 for a quick step by step guide.
The only preresquisites are to let the service be in debug mode and have the NET runtime installed on the host, but that's really all.
Sorry for not having known this earlier.
Alex

Related

How to sandbox code running another app (started using Process.Start)?

I've heard of sandboxing and how to make a simple example using AppDomain in .NET as in this article https://learn.microsoft.com/en-us/dotnet/framework/misc/how-to-run-partially-trusted-code-in-a-sandbox
However the unsafe (or untrusted) code I execute here is run another process using Process.Start (or if you know another way to help limit access of the started process, please suggest). My purpose is to constrain resource access of the started process (may not be a .NET app). So for example, the started process should not be able to access any file in the current environment.
The issue here is we need a security context (provided by the current AppDomain) having full-trust (unrestricted) for Process.Start to work.
I really hope that the current partially-trusted context (before calling Process.Start) would be cascaded down to the started process and can help constrain the resource access as expected. But if we need a full-trust context to run Process.Start, then it fails right at that step.
I've run out of ideas for how to make this possible because the only way I know to run a process in .NET is using Process.Start but it requires full-trust context … :(
Here is the code I've tried and there is always an error being thrown right before calling Process.Start:
class Program
{
static void Main(string[] args)
{
var ads = new AppDomainSetup();
ads.ApplicationBase = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);
var ps = new PermissionSet(System.Security.Permissions.PermissionState.None);
ps.AddPermission(new SecurityPermission(SecurityPermissionFlag.Execution));
var ad = AppDomain.CreateDomain("SB", null, ads, ps);
var sb = ad.CreateInstanceAndUnwrap(typeof(Sandbox).Assembly.FullName, typeof(Sandbox).FullName) as Sandbox;
//the code throws exception and be highlighted at this line
sb.ExecuteUnsafeCode();
Console.WriteLine("End!");
Console.ReadLine();
}
}
//the sandbox class
public class Sandbox : MarshalByRefObject
{
//this simple method stub is just for testing
public void ExecuteUnsafeCode()
{
try
{
var si = new ProcessStartInfo("someSimpleApp.exe");
Process.Start(si);
Console.WriteLine("Run OK!");
} catch(Exception ex)
{
Console.WriteLine("SecurityException caught:\n{0}", ex.ToString());
}
}
}
The exception thrown is a SecurityException with a very short message of Request fail. The stack-trace is also too short (only 3 lines) and actually contains nothing helpful.
The bigger picture of my purpose here is to run submitted code (from user) in a sandbox so that no malicious code can harm the server. If the submitted code is some .NET lang, it would be easier because I may not have to use Process.Start here. But the submitted code is Java or unmanaged C++, really we have to compile it into some executable file and run it using Process.Start.
I hope to get some suggestions to try out, of course it's better if I have a right solution for this, thanks!

What does the FabricNotReadableException mean? And how should we respond to it?

We are using the following method in a Stateful Service on Service-Fabric. The service has partitions. Sometimes we get a FabricNotReadableException from this peace of code.
public async Task HandleEvent(EventHandlerMessage message)
{
var queue = await StateManager.GetOrAddAsync<IReliableQueue<EventHandlerMessage>>(EventHandlerServiceConstants.EventHandlerQueueName);
using(ITransaction tx = StateManager.CreateTransaction())
{
await queue.EnqueueAsync(tx, message);
await tx.CommitAsync();
}
}
Does that mean that the partition is down and is being moved? Of that we hit a secondary partition? Because there is also a FabricNotPrimaryException that is being raised in some cases.
I have seen the MSDN link (https://msdn.microsoft.com/en-us/library/azure/system.fabric.fabricnotreadableexception.aspx). But what does
Represents an exception that is thrown when a partition cannot accept reads.
mean? What happened that a partition cannot accept a read?
Under the covers Service Fabric has several states that can impact whether a given replica can safely serve reads and writes. They are:
Granted (you can think of this as normal operation)
Not Primary
No Write Quorum (again mainly impacting writes)
Reconfiguration Pending
FabricNotPrimaryException which you mention can be thrown whenever a write is attempted on a replica which is not currently the Primary, and maps to the NotPrimary state.
FabricNotReadableException maps to the other states (you don't really need to worry or differentiate between them), and can happen in a variety of cases. One example is if the replica you are trying to perform the read on is a "Standby" replica (a replica which was down and which has been recovered, but there are already enough active replicas in the replica set). Another example is if the replica is a Primary but is being closed (say due to an upgrade or because it reported fault), or if it is currently undergoing a reconfiguration (say for example that another replica is being added). All of these conditions will result in the replica not being able to satisfy writes for a small amount of time due to certain safety checks and atomic changes that Service Fabric needs to handle under the hood.
You can consider FabricNotReadableException retriable. If you see it, just try the call again and eventually it will resolve into either NotPrimary or Granted. If you get FabricNotPrimary exception, generally this should be thrown back to the client (or the client in some way notified) that it needs to re-resolve in order to find the current Primary (the default communication stacks that Service Fabric ships take care of watching for non-retriable exceptions and re-resolving on your behalf).
There are two current known issues with FabricNotReadableException.
FabricNotReadableException should have two variants. The first should be explicitly retriable (FabricTransientNotReadableException) and the second should be FabricNotReadableException. The first version (Transient) is the most common and is probably what you are running into, certainly what you would run into in the majority of cases. The second (non-transient) would be returned in the case where you end up talking to a Standby replica. Talking to a standby won't happen with the out of the box transports and retry logic, but if you have your own it is possible to run into it.
The other issue is that today the FabricNotReadableException should be deriving from FabricTransientException, making it easier to determine what the correct behavior is.
Posted as an answer (to asnider's comment - Mar 16 at 17:42) because it was too long for comments! :)
I am also stuck in this catch 22. My svc starts and immediately receives messages. I want to encapsulate the service startup in OpenAsync and set up some ReliableDictionary values, then start receiving message. However, at this point the Fabric is not Readable and I need to split this "startup" between OpenAsync and RunAsync :(
RunAsync in my service and OpenAsync in my client also seem to have different Cancellation tokens, so I need to work around how to deal with this too. It just all feels a bit messy. I have a number of ideas on how to tidy this up in my code but has anyone come up with an elegant solution?
It would be nice if ICommunicationClient had a RunAsync interface that was called when the Fabric becomes ready/readable and cancelled when the Fabric shuts down the replica - this would seriously simplify my life. :)
I was running into the same problem. My listener was starting up before the main thread of the service. I queued the list of listeners needing to be started, and then activated them all early on in the main thread. As a result, all messages coming in were able to be handled and placed into the appropriate reliable storage. My simple solution (this is a service bus listener):
public Task<string> OpenAsync (CancellationToken cancellationToken)
{
string uri;
Start ();
uri = "<your endpoint here>";
return Task.FromResult (uri);
}
public static object lockOperations = new object ();
public static bool operationsStarted = false;
public static List<ClientAuthorizationBusCommunicationListener> pendingStarts = new List<ClientAuthorizationBusCommunicationListener> ();
public static void StartOperations ()
{
lock (lockOperations)
{
if (!operationsStarted)
{
foreach (ClientAuthorizationBusCommunicationListener listener in pendingStarts)
{
listener.DoStart ();
}
operationsStarted = true;
}
}
}
private static void QueueStart (ClientAuthorizationBusCommunicationListener listener)
{
lock (lockOperations)
{
if (operationsStarted)
{
listener.DoStart ();
}
else
{
pendingStarts.Add (listener);
}
}
}
private void Start ()
{
QueueStart (this);
}
private void DoStart ()
{
ServiceBus.WatchStatusChanges (HandleStatusMessage,
this.clientId,
out this.subscription);
}
========================
In the main thread, you call the function to start listener operations:
protected override async Task RunAsync (CancellationToken cancellationToken)
{
ClientAuthorizationBusCommunicationListener.StartOperations ();
...
This problem likely manifested itself here as the bus in question already had messages and started firing the second the listener was created. Trying to access anything in state manager was throwing the exception you were asking about.

ObjectDisposedException: Safe handle has been closed

So this is a rather small question with a big explanation. As is noted by the title I am getting an unhandled exception telling me my Safe handle has been closed. What I'll probably have to do is edit this post a few times with more and more code to help me diagnose what the problem is.
I'm using POS for .NET to make a Service Object for my RFID and MSR device. Although my devices are the same, I have 2 different Virtual COM Port chips that communicate to those devices. One by Silicon labs, the other by FTDI. I wanted to use the plug and play features with POS for .NET so I gave it both my Hardware ID's. Because it is plug and play I have the full hardware path available to me which I can then create a SafeFileHandle using a call to PInvoke and using that SafeFileHandle I create a FileStream. The FTDI chip doesn't let me talk to the devices directly like that so I have to get the friendly name of the device then use mutex to pull out the COM port then create a SerialPort instance. That step works fine and great. As a FYI I have tried to use the Friendly name of both chips to get the COM port and the Silicon Labs one (for some strange reason) doesn't get listed using SetupAPI.GetDeviceDetails using the Ports GUID. I'm not sure on that one since in Device Manager the Silicon labs Device Class Guid is the Ports GUID.
Well since both the SerialPort and the FileStream have a Stream object I decided to use that to read and write to that port. The problem with that is if I send a RFID command to the MSR device the MSR device doesn't respond back with anything. So if I use this code int fromReader = ReaderStream.ReadByte(); my thread is blocked. It's a blocking call and requires a minimum of 1 byte to proceed. So I looked around and it appears the only solution is to use a separate thread and set a timeout. If the timeout happens then abort the thread.
Thread t = new Thread(new ThreadStart(ReadFromStream));
t.Start();
if (!t.Join(timeout))
{
t.Abort();
}
(t.Abort has been surrounded with a try/catch to no avail, since it didn't fix the problem I removed it)
ReadFromStream is Abstract method in RFID Device. Here is one of the implementations
protected override void ReadFromStream()
{
var commandLength = USN3170Constants.MIN_RESPONSE_LENGTH;
var response = new System.Collections.Generic.List<byte>(USN3170Constants.MIN_RESPONSE_LENGTH);
for (int i = 0; i <= commandLength; i++)
{
int fromReader = ReaderStream.ReadByte();
if (fromReader == -1) break; //at end of stream
response.Add((byte)fromReader);
if (response.Count > USN3170Constants.DATA_LENGTH_INDEX && response[USN3170Constants.DATA_LENGTH_INDEX] > 0)
{
commandLength = response[USN3170Constants.DATA_LENGTH_INDEX] + 3;
}
}
streamBuffer = response.ToArray();
}
(int fromReader = ReaderStream.ReadByte(); was surrounded with a try/catch. Only thing it caught was the aborted thread exception, so I took it out)
The above code is where I suspect the problem lies. The strange thing is, though, is that I have a unit test which I feel mimics rather well the Microsoft Test App.
(FYI QUADPORT is the FTDI chipset)
PosExplorer posExplorer;
DeviceCollection smartCardRWs;
[Test]
public void TestQuadPortOpen()
{
posExplorer = new PosExplorer();
smartCardRWs = posExplorer.GetDevices(DeviceType.SmartCardRW, DeviceCompatibilities.CompatibilityLevel1);
//if using quadport one item is the MSR and the other is the RFID
//because of that one of them will fail. Currently the first Device in the collection is the the RFID, and the second is MSR
Assert.GreaterOrEqual(smartCardRWs.Count, 2);
//Hardware Id: QUADPORT\QUAD_SERIAL_INTERFACE
foreach(DeviceInfo item in smartCardRWs)
{
Assert.AreEqual("QUADPORT\\QUAD_SERIAL_INTERFACE", item.HardwareId);
}
SmartCardRW rfidDevice = (SmartCardRW)posExplorer.CreateInstance(smartCardRWs[0]);
SmartCardRW msrDevice = (SmartCardRW)posExplorer.CreateInstance(smartCardRWs[1]);
rfidDevice.Open();
Assert.AreNotEqual(ControlState.Closed, rfidDevice.State);
rfidDevice.Close();
try
{
msrDevice.Open();
Assert.Fail("MSR Device is not a RFID Device");
}
catch
{
Assert.AreEqual(ControlState.Closed, msrDevice.State);
}
rfidDevice = null;
msrDevice = null;
}
When I run that test I do not get the SafeFileHandle exception. In fact the test passes.
So I am at a loss as to how to track down this bug. Since I'll be using this Service Object in a different program that I am also creating I'll probably end up using this code from this test in that program. However I feel that the Microsoft Test App is more or less the "Golden Standard". Is it really... probably not. But it does work good for my purposes, SO I feel it is a problem with my code and not theirs.
Any tricks on how I can narrow this down? FYI I've tried using the debugger but walking the Open Code the error does not occur. I also walked the Update Status Timer and it also does not throw the error. Once I hit continue then I'll get the exception. I turned of Just My Code and Loaded Symbols and it tells me "Source Information is missing from teh debug information for this module"
This problem (and in particular the reference to a SerialPort instance) sounds suspiciously like the problem documented at http://connect.microsoft.com/VisualStudio/feedback/details/140018/serialport-crashes-after-disconnect-of-usb-com-port.
As I understand it, in the case of a non-permanent SerialPort (like one associated with a USB device, for example) when the port "goes away" unexpectedly the underlying Stream associated with it gets disposed. If there is an active read or write operation on the port at the time a subsequent call to SerialPort.Close can lead to the exception you mention, however the exception is occurring in Microsoft's code running on a different thread and cannot be caught from within your code. (It will still be seen by any "last chance" exception handler you have bound to the UnhandledException event on the AppDomain.)
There seem to be two basic workaround styles in the linked document. In both instances, after opening the port you store a reference to the BaseStream instance for the open port. One workaround then suppresses garbage collection on that base stream. The other explicitly calls Close on the base stream, capturing any exceptions thrown during that operation, before calling Close on the SerialPort.
EDIT: For what it's worth, under the .NET framework V4.5, it appears that none of the documented workarounds on the Microsoft Connect site fully resolve the problem although they may be reducing the frequency with which it occurs. :-(
I had the same error when I used a thread to read from a SerialPort. Calling Interrupt on the thread occasionally caused the uncatchable ObjectDisposedException. After hours of debugging and carefully reading this:
https://blogs.msdn.microsoft.com/bclteam/2006/10/10/top-5-serialport-tips-kim-hamilton/
I realized that the problem is just this:
NET 2.0 (and above) isn’t letting you get away with some things, such as attempting to cancel a SerialPort read by interrupting the thread accessing the SerialPort.
So before you call Thread.Interrupt() you have to close the COM... This will cause a catchable exception on the ReadByte operation.
Or you may use the ReadTimeout property on the SerialPort to avoid using a thread just to have a timeout.
I would like to post my case in which I had a similar issue trying to read from a serial port (virtual com driven by a Moxa RS232 to ethernet).
Since I did have no chance to catch the ObjectDisposedException, the only solution was to increase the ReadTimeout property which was originally set to -1 (continuous reading).
Setting the ReadTimeout to 100 millis solved this issue in my case.
EDIT
It is not the definitive solution: it can happen that if you close the application during a read attempt you can get the same uncatchable exception.
My final solution is to kill the process of the application directly in the FormClosing event :
private void MyForm_FormClosing(object sender, FormClosingEventArgs e)
{
Process p = Process.GetCurrentProcess();
p.Kill();
}
Please take a look at this:
https://github.com/jcurl/SerialPortStream
I replaced System.IO.Ports with RJPC.IO.Ports, fixed up a couple parameter differences in the initialization, and all the problems went away with this issue.

Debugging/profiling/optimizing C# Windows service in VS 2012

I am creating a Windows service in C#. Its purpose is to consume info from a feed on the Internet. I get the data by using zeromq's pub/sub architecture (my service is a subscriber only). To debug the service I "host" it in a WPF control panel. This allows me to start, run, and stop the service without having to install it. The problem I am seeing is that when I call my stop method it appears as though the service continues to write to the database. I know this because I put a Debug.WriteLine() where the writing occurs.
More info on the service:
I am attempting to construct my service in a fashion that allows it to write to the database asynchronously. This is accomplished by using a combination of threads and the ThreadPool.
public void StartDataReceiver() // Entry point to service from WPF host
{
// setup zmq subscriber socket
receiverThread = new Tread(SpawnReceivers);
receiverThread.Start();
}
internal void SpawnReceivers()
{
while(!stopEvent.WaitOne(0))
{
ThreadPool.QueueUserWorkItem(new WaitCallback(ProcessReceivedData), subscriber.Recv()); // subscriber.Recv() blocks when there is no data to receive (according to the zmq docs) so this loop should remain under control, and threads only created in the pool when there is data to process.
}
}
internal void ProcessReceivedData(Object recvdData)
{
// cast recvdData from object -> byte[]
// convert byte[] -> JSON string
// deserialize JSON -> MyData
using (MyDataEntities context = new MyDataEntities())
{
// build up EF model object
Debug.WriteLine("Write obj to db...");
context.MyDatas.Add(myEFModel);
context.SaveChanges();
}
}
internal void QData(Object recvdData)
{
Debug.WriteLine("Queued obj in queue...");
q.Enqueue((byte[])recvdData);
}
public void StopDataReceiver()
{
stopEvent.Set();
receiverThread.Join();
subscriber.Dispose();
zmqContext.Dispose();
stopEvent.Reset();
}
The above code are the methods that I am concerned with. When I debug the WPF host, and the method ProcessReceivedData is set to be queued in the thread pool everything seems to work as expected, until I stop the service by calling StopDataReceiver. As far as I can tell the thread pool never queues any more threads (I checked this by placing a break point on that line), but I continue to see "Write obj to db..." in the output window and when I 'Break All' in the debugger a little green arrow appears on the context.SaveChanges(); line indicating that is where execution is currently halted. When I test some more, and have the thread pool queue up the method QData everything seems to work as expected. I see "Queued obj in queue..." messages in the output window until I stop the service. Once I do no more messages in the output window.
TL;DR:
I don't know how to determine if the Entity Framework is just slowing things way down and the messages I am seeing are just the thread pool clearing its backlog of work items, or if there is something larger at play. How do I go about solving something like this?
Would a better solution be to queue the incoming JSON strings as byte[] like I do in the QData method then have the thread pool queue up a different method to work on clearing the queue. I feel that that solution will only shift the problem around and not actually solve it.
Could another solution be to write a new service dedicated to clearing that queue? The problem I see with writing another service would be that I would probably have to use WCF (or possibly zmq) to communicate between the two services which would obviously add overhead and possibly become less performant.
I see the critical section in all of this being the part of getting the data off the wire fast enough because the publisher I am subscribed to is set to begin discarding messages if my subscriber can't keep up.

Is there a way to get the stacktraces for all threads in c#, like java.lang.Thread.getAllStackTraces()?

In java it is possible to get a snapshot of the stacktraces of all running threads.
This is done with java.lang.Thread.getAllStackTraces() (it returns Map<Thread,StackTraceElement[]>).
How can this be done with .net?
So I actually just had to figure out how to do this -- haven't used this solution extensively in production yet, but theres a relatively new library called ClrMd.
http://blogs.msdn.com/b/dougste/archive/2013/05/04/clrmd-net-crash-dump-and-live-process-inspection.aspx
Using it, I'm able to attach to my own process and get a stack trace for all live threads. Using this when a deadlock is detected before restarting our app like so:
var result = new Dictionary<int, string[]>();
var pid = Process.GetCurrentProcess().Id;
using (var dataTarget = DataTarget.AttachToProcess(pid, 5000, AttachFlag.Passive))
{
ClrInfo runtimeInfo = dataTarget.ClrVersions[0];
var runtime = runtimeInfo.CreateRuntime();
foreach (var t in runtime.Threads)
{
result.Add(
t.ManagedThreadId,
t.StackTrace.Select(f =>
{
if (f.Method != null)
{
return f.Method.Type.Name + "." + f.Method.Name;
}
return null;
}).ToArray()
);
}
}
var json = JsonConvert.SerializeObject(result);
zip.AddEntry("_threads.json", json);
The really important thing to get that to work from the same process is AttachFlag.Passive
If you just do DataTarget.AttachToProcess(pid, 5000), it'll do an "invasive" attach which attempts to pause the process. This throws an exception when you try to attach to your own process, I'm assuming because you can't pause your application while trying to attach from your application or something like that.
If you want to get stack traces of all the threads within managed code then you could try mdbg. Have a look at Managed Stack Explorer it does use mdbg and gets stacks of all the threads.
If you want this for debugging purposes alone, the SOS extensions to WinDbg can give you this information.
The command to run is "*~e !clrstack".
Inside of a running C# program, there is no public way to enumerate managed threads or look them up by ID. Even if you could, getting a stack trace on a different thread would likely require it to be suspended, which has some risks of side effects (see why this is obsolete).
The other alternative is to enlist threads as they are known, and scan them at your leisure. This is probably only possible if you're explicitly creating thread objects rather than using the thread pool.
That said, it is also hard for me to see what purpose this approach would serve. If it is for debugging, there are far more powerful techniques that can be done in-memory or on mini-dumps. If it is for logging, then it might make sense to have logging calls contribute their own stacks.
Updated code to get a snapshot of all stack traces that uses the answer from #Joshua Evensen as a base. You'll still need to install NuGet package CLR Memory Diagnostics (ClrMD). This snippet also includes extra code to get the thread names, but this isn't required if you just want the stack traces.
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using Microsoft.Diagnostics.Runtime;
namespace CSharpUtils.wrc.utils.debugging
{
public static class StackTraceAnalysis
{
public static string GetAllStackTraces()
{
var result = new StringBuilder();
using (var target = DataTarget.CreateSnapshotAndAttach(Process.GetCurrentProcess().Id))
{
var runtime = target.ClrVersions.First().CreateRuntime();
// We can't get the thread name from the ClrThead objects, so we'll look for
// Thread instances on the heap and get the names from those.
var threadNameLookup = new Dictionary<int, string>();
foreach (var obj in runtime.Heap.EnumerateObjects())
{
if (!(obj.Type is null) && obj.Type.Name == "System.Threading.Thread")
{
var threadId = obj.ReadField<int>("m_ManagedThreadId");
var threadName = obj.ReadStringField("m_Name");
threadNameLookup[threadId] = threadName;
}
}
foreach (var thread in runtime.Threads)
{
threadNameLookup.TryGetValue(thread.ManagedThreadId, out string threadName);
result.AppendLine(
$"ManagedThreadId: {thread.ManagedThreadId}, Name: {threadName}, OSThreadId: {thread.OSThreadId}, Thread: IsAlive: {thread.IsAlive}, IsBackground: {thread.IsBackground}");
foreach (var clrStackFrame in thread.EnumerateStackTrace())
result.AppendLine($"{clrStackFrame.Method}");
}
}
return result.ToString();
}
}
}
You can use ProcInsp, which has a web API to get threads with their stacks in JSON. The web API is available at /Process/%PID%/Threads (use a GET request).
Disclaimer: I'm the developer of ProcInsp. The tool is under the MIT licence and is free for use.
As Mason of Words suggests, this doesn't look possible from within the managed code itself.
Could you clarify why you need this: there might be a better solution?
For example, if you attach to the process in Visual Studio and press "pause", then the "Threads" window will list all managed threads, and the "Stacktrace" window can show the current stack trace for each thread. Would that suffice?
There is a StackTrace class
var trace = new System.Diagnostics.StackTrace(exception);
http://msdn.microsoft.com/en-us/library/system.diagnostics.stacktrace.aspx
You can loop on System.Diagnostics.Process.GetCurrentProcess().Threads and for each Thread create a StackTrace object with the .ctor that takes a Thread as its param.

Categories