I have an error condition in my connection handling code. This error causes my server to freeze, but RANDOMLY, on some requests. It's important the requests are identical. Let's say 25% of them causes the server to freeze.
Here's the catch: the server freezes when and only when the first TEST request is made. On infinite number of subsequent TEST requests it won't fail... It's very important: it happens only when testing with my unit test.
My test just runs DirectoryEntry query. It creates n DirectoryEntry objects, queries them, then calls Close on each one, then repeats the procedure m times.
If it fails, it fails on first query.
BUT THEN if I connect real LDAP client like LdapAdmin, it behaves completely different! The chance the server will crash is very high, over 50%. Obviously, the real client does something differently while connecting, and obviously, even my C# test code also does that crashing thing, but only for the first time. My subsequent test queries do not test the server at all. They have zero efficiency against detecting "client connecting" freeze.
Now I test it by manually running my test "exe" file several times. Probability of catching the freeze is nearly 100% after 8 tries. Now I want my unit test to do that for me - do the offending action let's say 8 times and fail if it freezes once.
Unfortunately - the state of connecting client is different on application start and on subsequent calls. My question is WHAT EXACTLY is different? I tested it's not the server under the test. I can crash it anytime by connecting to it from external client. First attempt - success, client connected, client can try to DDOS my server, good luck, it doesn't happen. But then I try to connect again, server dies. I try to simulate this behavior in my test, but I fail. Either I detect crash on first attempt, or I get infinite number of successful connections. I need a way to truly reset my test client, so they would behave like a new client app connecting to my server for the first time. Close and Dispose methods of DirectoryEntry do not achieve that goal, or something else is involved.
No code sample, because DirectoryEntry query is a one-liner. My test should just execute such query, but should also disconnect afterwards forgetting it ever connected to the server. I want each DirectoryEntry query be as fresh and first as when the application is started.
BTW, I already found the cause of this particular freeze, now I want such case to be detectable by my unit test to avoid similar issues when the code is changed.
Another weird finding: When I try to connect 10k clients to my server - it passes... or fails on the first attempt.
Also tried:
I waited for Disposed event of DirectoryEntry objects. Didn't help. Still no joy. To catch the freeze I must run the same test several times.
Tried to dispose all objects including the tested server and recreate them multiple times during the test. Still no joy. Either fails on start, or doesn't fail no matter how many times this test is iterated.
Related
I am using a Lab View application to simulate a test running, which would post a JSON string to my ASP.NET application. Within the ASP.NET application I format the data with the proper partition and row keys, then send it to Azure Table Storage.
The problem that I am having is that after what seems like a random amount of time (i.e. 5 minutes, 2 hours, 5 hours), the data fails to be saved into Azure. I am try to catch any exceptions within the ASP.NET application and send the error message back to the Lab View app and the Lab View app is also catching any exceptions in may encounter so I can trouble shoot where the issue is occurring.
The only error that I am able to catch is a Timeout Error 56 in the Lab View program. My question is, does anyone have an idea of where I should be looking for the root cause of this? I do not know where to begin.
EDIT:
I am using a table storage writer that I found here to do batch operations with retries.
The constructor for exponential retry policy is below:
public ExponentialRetry(TimeSpan deltaBackoff, int maxAttempts)
when you (or the library you use to be exact) instantiate this as RetryPolicy = new ExponentialRetry(TimeSpan.FromMilliseconds(2),100) you are basically setting the max attempts as 100 which means you may end up waiting up to around 2^100 milliseconds (there is some more math behind this but just simplifying) for each of your individual batch requests to fail on the client side until the sdk gives up retrying.
The other issue with that code is it executes batch requests sequentially and synchronously, that has multiple bad effects, first, all subsequent batch requests are blocked by the current batch request, second your cores are blocked waiting on I/O operations, third it has no exception handling so if one of the batch operations throw an exception, the method bails out and would not continue any further processing other batch requests.
My recommendation, do not use that library, batch operations are fairly straight forward. The default retry policy if you do not explicitly define is the exponential retry policy anyways with sensible default parameters (does 3 retries) so you do not even need to define your own retry object. For best scalability and throughput run your batch operations async (and concurrently).
As to why things fail, when you write your own api, catch the StorageException and check the http status code on the exception itself. You could be getting throttled by azure as one of the possibilities but it is hard to say without further debugging or you providing the http status code for the failed batch operations to us.
You need to check whether an exception is transient or not. As Peter said on his comment, Azure Storage client already implements a retry policy. You can also wrap your code with another retry code (e.g using polly) or you should change the default policy associated to Azure Storage Client.
I'm trying to add a member to a Group using SignalR 2.2. Every single time, I hit a 30 second timeout and get a "System.Threading.Tasks.TaskCanceledException: A task was canceled." error.
From a GroupSubscriptionController that I've written, I'm calling:
var hubContext = GlobalHost.ConnectionManager.GetHubContext<ProjectHub>();
await hubContext.Groups.Add(connectionId, groupName);
I've found this issue where people are periodically encountering this, but it happens to me every single time. I'm running the backend (ASP.NET 4.5) on one VS2015 launched localhost port, and the frontend (AngularJS SPA) on another VS 2015 launched localhost port.
I had gotten SignalR working to the point where messages were being broadcast to every connected client. It seemed so easy. Now, adding in the Groups part (so that people only get select messages from the server) has me pulling my hair out...
That task cancellation error could be being thrown because the connectionId can't be found in the SignalR registry of connected clients.
How are you getting this connectionId? You have multiple servers/ports going - is it possible that you're getting your wires crossed?
I know there is an accepted answer to this, but I came across this once for a different reason.
First off, do you know what Groups.Add does?
I had expected Groups.Add's task to complete almost immediately every time, but not so. Groups.Add returns a task that only completes, when the client (i.e. Javascript) acknowledges that it has been added to a group - this is useful for reconnecting so it can resubscribe to all its old groups. Note this acknowledgement is not visible to the developer code and nicely covered up for you.
The problem is that the client may not respond because they have disconnected (i.e. they've navigated to another page). This will mean that the await call will have to wait until the connection has disconnected (default timeout 30 seconds) before giving up by throwing a TaskCanceledException.
See http://www.asp.net/signalr/overview/guide-to-the-api/working-with-groups for more detail on groups
We have an issue in our Rebus/RabbitMQ setup where Rebus suddenly stops retrieving/handling messages from RabbitMQ. This has happened two times in the last month and we're not really sure how to proceed.
Our RabbitMQ setup has two nodes on different servers, and the Rebus side is a windows service.
We see no errors in Rebus or in the eventlog on the server where Rebus runs. We also do not see errors on the RabbitMQ servers.
Rebus (and the windows service) keeps running as we do see other log messages, like the DueTimeOutSchedular and timeoutreplies. However it seems the worker thread stops running, but without any errors being logged.
It results in a RabbitMQ input queue that keeps growing :(, we're adding logging to monitor this so we get notified if it happens again.
But I'm looking for advise on how to continue the "investigation" and ideas on how to prevent this. Maybe some of you have experienced this before?
UPDATE
It seems that we actually did have a node crashing, at least the last time it happened. The master RabbitMQ node crashed (the server crashed) and the slave was promoted to master. As far as I can see from the RabbitMQ logs on the nodes everything went according to planned. There are no other errors in the RabbitMQ logs.
At the time this happened Rebus was configured to connect only to the node that was the slave (then promoted to master) so Rebus did not experience the rabbitmq failure and thus no Rebus connection errors. However, it seems that Rebus stopped handling messages when the failure occurred.
We are actually experiencing this on a few queues it seems, and some of them, but not all seems to have ended up in an unsynchronized state.
UPDATE 2
I was able to reproduce the problem quite easily, so it might be a configuration issue in our setup. But this is what we do to reproduce it
Start two nodes in a cluster, ex. rabbit1 (master) and rabbit2 (slave)
Rebus connects to rabbit2, the slave
Close rabbit1, the master. rabbit2 is promoted to master
The queues are mirrored
We have two small tests apps to reproduce this, a "sender" that sends a message every second and a "consumer" that handles the messages.
When rabbit1 is closed, the "consumer" stops handling messages, but the "sender" keeps sending the messages and the queue keeps growing.
Start rabbit1 again, it joins as slave
This has no effect and the "consumer" still does not handle messages.
Restart the "consumer" app
When the "consumer" is restarted it retrieves all the messages and handles them.
I think I have followed the setup guides correctly, but it might be a configuration issue on our part. I can't seem to find anything that would suggest what we have done wrong.
Rebus is still connected to RabbitMQ, we see that in the connections tab on the management site, the "consumers" send/recieved B/s drop to about 2 B/s when it stops handling messages
UPDATE 3
Ok so I downloaded the Rebus source and attached to our process so I could see what happens in the "RabbitMqMessageQueue" class when it stops. When "rabbit1* is closed the "BasicDeliverEventArgs" is null, this is the code
BasicDeliverEventArgs ea;
if (!threadBoundSubscription.Next((int)BackoffTime.TotalMilliseconds, out ea))
{
return null;
}
// wtf??
if (ea == null)
{
return null;
}
See: https://github.com/rebus-org/Rebus/blob/master/src/Rebus.RabbitMQ/RabbitMqMessageQueue.cs#L178
I like the "wtf ??" comment :)
That sounds very weird!
Whenever Rebus' RabbitMQ transport experiences an error on the connection, it will throw out the connection, wait a few seconds, and ensure that the connection is re-established again when it can.
You can see the relevant place in the source here: https://github.com/rebus-org/Rebus/blob/master/src/Rebus.RabbitMQ/RabbitMqMessageQueue.cs#L205
So I guess the question is whether the RabbitMQ client library can somehow enter a faulted state, silently, without throwing an exception when Rebus attemps to get the next message...?
When you experienced the error, did you check out the 'connections' tab in RabbitMQ management UI and see if the client was still connected?
Update:
Thanks for you thorough investigation :)
The "wtf??" is in there because I once experienced a hiccup when ea had apparently been null, which was unexpected at the time, thus causing a NullReferenceException later on and the vomiting of exceptions all over my logs.
According to the docs, Next will return true and set the result to null when it reaches "end-of-stream", which is apparently what happens when the underlying model is closed.
The correct behavior in that case for Rebus would be to throw a proper exception and let the connection be re-established - I'll implement that right away!
Sit tight, I'll have a fix ready for you in a few minutes!
I have an SQL Server (2008 R2) based (C# WinForms) application that predominantly runs on a local machine using a local installation of SQL Server 2008 R2. One problem I have is that if the user does not have a server instance running and tries to execute some commands or perform some operations, the queries are sent off to SQL Server and it takes an age to throw an SqlException telling me the requested instance is not started.
I have read the following question and associated answers, but these solutions are far from ideal. WMI seem very much over-kill and I do not want to have to include extra .dlls in my installation package for the software if it can be avoided.
I have also come accross the SqlDataSourceEnumerator Class documented here
// Retrieve the enumerator instance and then the data.
SqlDataSourceEnumerator instance = SqlDataSourceEnumerator.Instance;
System.Data.DataTable table = instance.GetDataSources();
which dumps the available connection into a DataTable. However, there seems to be inherent problems with returning all the available connections:
"All of the available servers may or may not be listed. The list can vary depending on
factors such as timeouts and network traffic. This can cause the list to be different
on two consecutive calls." - MSDN.
There has to be a set way of dealing with this problem. Say I have the following SqlConnection string:
Data Source=localhost;Initial Catalog=MyDB;Integrated Security=True;Connection Timeout = 0
what can I use as an efficient (this is crucial) check as to whether the selected instance ('localhost' [the default instance] or 'SomeInstanceName') is running?
Thanks for your time.
I don't think you need to worry about timeouts or network issues when the server and client are the same machine. Just attempting to connect is about efficient as you're going to get, the crucial part is going to be how long do you let the connection attempt try before you give up (connection timeout). You can shorten that window obviously, but if you make it too short, then the problem doesn't really make sense.
You can change the connection timeout to be a shorter period, but essentially, the only way it knows that a server isn't there, is from a timeout.
Any technique you use will likely have the exact same timeout issue.
If you know the instance name, such as "MSSQL$InstanceName" you can use the System.ServiceProcess.ServiceController class to get a list of all services on the machine and then loop through looking to see if any ServiceName == MSSQL$InstanceName.
I have found this to be very fast plus you can check to see if it is running and start it if it is not running.
You could try opening up a simple TCP connection to the standard SQL port and see if it sticks .
set the connection timeout to some reasonably low value based on your environment.
see
http://support.microsoft.com/kb/287932
for sql serevr port numbers.
Every now and then in a high volume .NET application, you might see this exception when you try to execute a query:
System.Data.SqlClient.SqlException: A transport-level error has
occurred when sending the request to the server.
According to my research, this is something that "just happens" and not much can be done to prevent it. It does not happen as a result of a bad query, and generally cannot be duplicated. It just crops up maybe once every few days in a busy OLTP system when the TCP connection to the database goes bad for some reason.
I am forced to detect this error by parsing the exception message, and then retrying the entire operation from scratch, to include using a new connection. None of that is pretty.
Anybody have any alternate solutions?
I posted an answer on another question on another topic that might have some use here. That answer involved SMB connections, not SQL. However it was identical in that it involved a low-level transport error.
What we found was that in a heavy load situation, it was fairly easy for the remote server to time out connections at the TCP layer simply because the server was busy. Part of the reason was the defaults for how many times TCP will retransmit data on Windows weren't appropriate for our situation.
Take a look at the registry settings for tuning TCP/IP on Windows. In particular you want to look at TcpMaxDataRetransmissions and maybe TcpMaxConnectRetransmissions. These default to 5 and 2 respectively, try upping them a little bit on the client system and duplicate the load situation.
Don't go crazy! TCP doubles the timeout with each successive retransmission, so the timeout behavior for bad connections can go exponential on you if you increase these too much. As I recall upping TcpMaxDataRetransmissions to 6 or 7 solved our problem in the vast majority of cases.
This blog post by Michael Aspengren explains the error message "A transport-level error has occurred when sending the request to the server."
To answer your original question:
A more elegant way to detect this particular error, without parsing the error message, is to inspect the Number property of the SqlException.
(This actually returns the error number from the first SqlError in the Errors collection, but in your case the transport error should be the only one in the collection.)
I had the same problem albeit it was with service requests to a SQL DB.
This is what I had in my service error log:
System.Data.SqlClient.SqlException: A transport-level error has occurred when sending the request to the server. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)
I have a C# test suite that tests a service. The service and DB were both on external servers so I thought that might be the issue. So I deployed the service and DB locally to no avail. The issue continued. The test suite isn't even a hard pressing performance test at all, so I had no idea what was happening. The same test was failing each time, but when I disabled that test, another one would fail continuously.
I tried other methods suggested on the Internet that didn't work either:
Increase the registry values of TcpMaxDataRetransmissions and TcpMaxConnectRetransmissions.
Disable the "Shared Memory" option within SQL Server Configuration Manager under "Client Protocols" and sort TCP/IP to 1st in the list.
This might occur when you are testing scalability with a large number of client connection attempts. To resolve this issue, use the regedit.exe utility to add a new DWORD value named SynAttackProtect to the registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\ with value data of 00000000.
My last resort was to use the old age saying "Try and try again". So I have nested try-catch statements to ensure that if the TCP/IP connection is lost in the lower communications protocol that it does't just give up there but tries again. This is now working for me, however it's not a very elegant solution.
use Enterprise Services with transactional components
I have seen this happen in my own environment a number of times. The client application in this case is installed on many machines. Some of those machines happen to be laptops people were leaving the application open disconnecting it and then plugging it back in and attempting to use it. This will then cause the error you have mentioned.
My first point would be to look at the network and ensure that servers aren't on DHCP and renewing IP Addresses causing this error. If that isn't the case then you have to start trawlling through your event logs looking for other network related.
Unfortunately it is as stated above a network error. The main thing you can do is just monitor the connections using a tool like netmon and work back from there.
Good Luck.
You should also check hardware connectivity to the database.
Perhaps this thread will be helpful:
http://channel9.msdn.com/forums/TechOff/234271-Conenction-forcibly-closed-SQL-2005/
I'm using reliability layer around my DB commands (abstracted away in the repository interfaece). Basically that's just code that intercepts any expected exception (DbException and also InvalidOperationException, that happens to get thrown on connectivity issues), logs it, captures statistics and retries everything again.
With that reliability layer present, the service has been able to survive stress-testing gracefully (constant dead-locks, network failures etc). Production is far less hostile than that.
PS: There is more on that here (along with a simple way to define reliability with the interception DSL)
I had the same problem. I asked my network geek friends, and all said what people have replied here: Its the connection between the computer and the database server. In my case it was my Internet Service Provider, or there router that was the problem. After a Router update, the problem went away. But do you have any other drop-outs of internet connection from you're computer or server? I had...
I experienced the transport error this morning in SSMS while connected to SQL 2008 R2 Express.
I was trying to import a CSV with \r\n. I coded my row terminator for 0x0d0x0a. When I changed it to 0x0a, the error stopped. I can change it back and forth and watch it happen/not happen.
BULK INSERT #t1 FROM 'C:\123\Import123.csv' WITH
( FIRSTROW = 1, FIELDTERMINATOR = ',', ROWTERMINATOR = '0x0d0x0a' )
I suspect I am not writing my row terminator correctly because SQL parses one character at a time right while I'm trying to pass two characters.
Anyhow, this error is 4 years old now, but it may provide a bit of information for the next user.
I just wanted to post a fix here that worked for our company on new software we've installed. We were getting the following error since day 1 on the client log file: Server was unable to process request. ---> A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The semaphore timeout period has expired.) ---> The semaphore timeout period has expired.
What completely fixed the problem was to set up a link aggregate (LAG) on our switch. Our Dell FX1 server has redundant fiber lines coming out of the back of it. We did not realize that the switch they're plugged into needed to have a LAG configured on those two ports. See details here: https://docs.meraki.com/display/MS/Switch+Ports#SwitchPorts-LinkAggregation