Couchbase NodeUnavailableException in .NET SDK - c#

We are encountering this exception very often in our production code without any increase in number of requests to Couchbase or any memory pressure on the server itself.
The node has been allocated 30GB of RAM and the usage is of 3GB maximum but every now and then this exception is being thrown. The bucket is opened only once per application lifetime and only get and upsert operations are performed afterwards. The connection is initialised like this:
Config = new ClientConfiguration()
{
Servers = serverList,
UseSsl = false,
DefaultOperationLifespan = 2500,
BucketConfigs = new Dictionary<string, BucketConfiguration>
{
{ bucketName, new BucketConfiguration
{
BucketName = bucketName,
UseSsl = false,
DefaultOperationLifespan = 2500,
PoolConfiguration = new PoolConfiguration
{
MaxSize = 2000,
MinSize = 200,
SendTimeout = (int)Configuration.Config.Instance.CouchbaseConfig.Timeout
}
}}
}
};
Cluster = new Cluster(Config);
Bucket = Cluster.OpenBucket();
Can you please let me know if this initialisation is correct and more importantly what to check on the Couchbase server to find the cause of this issue? I have checked all logs on the server but could not find anything special at the time when those errors are being thrown.
Thank you,
Stacktrace:
System.Exception.Couchbase exception
at ###.DataLayer.Couchbase.CouchbaseUserOperations.Get()
at ###.API.Services.BaseService`1.SetUserID()
at ###.API.Services.EventsService+<GetResponse>d__0.MoveNext()
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start()
at ###.API.Services.EventsService.GetResponse()
at ###.API.Services.BaseService`1+<Any>d__28.MoveNext()
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start()
at ###.API.Services.BaseService`1.Any()
at lambda_method()
at ServiceStack.Host.ServiceRunner`1.Execute()
at ServiceStack.Host.ServiceRunner`1.Process()
at ServiceStack.Host.ServiceExec`1.Execute()
at ServiceStack.Host.ServiceRequestExec`2.Execute()
at ServiceStack.Host.ServiceController.ManagedServiceExec()
at ServiceStack.Host.ServiceController+<>c__DisplayClass11.<RegisterServiceExecutor>b__f()
at ServiceStack.Host.ServiceController.Execute()
at ServiceStack.HostContext.ExecuteService()
at ServiceStack.Host.RestHandler.ProcessRequestAsync()
at ServiceStack.Host.Handlers.HttpAsyncTaskHandler.System.Web.IHttpAsyncHandler.BeginProcessRequest()
at System.Web.HttpApplication+CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
at System.Web.HttpApplication.ExecuteStep()
at System.Web.HttpApplication+PipelineStepManager.ResumeSteps()
at System.Web.HttpApplication.BeginProcessRequestNotification()
at System.Web.HttpRuntime.ProcessRequestNotificationPrivate()
at System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper()
at System.Web.Hosting.PipelineRuntime.ProcessRequestNotification()
at System.Web.Hosting.UnsafeIISMethods.MgdIndicateCompletion()
at System.Web.Hosting.UnsafeIISMethods.MgdIndicateCompletion()
at System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper()
at System.Web.Hosting.PipelineRuntime.ProcessRequestNotification()
Caused by: System.Exception : Couchbase.Core.NodeUnavailableException: The node 172.31.34.105:11210 that the key was mapped to is either down or unreachable. The SDK will continue to try to connect every 1000ms. Until it can connect every operation routed to it will fail with this exception.
at ###.DataLayer.Couchbase.CouchbaseUserOperations.Get()
at ###.API.Services.BaseService`1.SetUserID()
at ###.API.Services.EventsService+<GetResponse>d__0.MoveNext()
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start()
at ###.API.Services.EventsService.GetResponse()
at ###.API.Services.BaseService`1+<Any>d__28.MoveNext()
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start()
at ###.API.Services.BaseService`1.Any()
at lambda_method()
at ServiceStack.Host.ServiceRunner`1.Execute()
at ServiceStack.Host.ServiceRunner`1.Process()
at ServiceStack.Host.ServiceExec`1.Execute()
at ServiceStack.Host.ServiceRequestExec`2.Execute()
at ServiceStack.Host.ServiceController.ManagedServiceExec()
at ServiceStack.Host.ServiceController+<>c__DisplayClass11.<RegisterServiceExecutor>b__f()
at ServiceStack.Host.ServiceController.Execute()
at ServiceStack.HostContext.ExecuteService()
at ServiceStack.Host.RestHandler.ProcessRequestAsync()
at ServiceStack.Host.Handlers.HttpAsyncTaskHandler.System.Web.IHttpAsyncHandler.BeginProcessRequest()
at System.Web.HttpApplication+CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
at System.Web.HttpApplication.ExecuteStep()
at System.Web.HttpApplication+PipelineStepManager.ResumeSteps()
at System.Web.HttpApplication.BeginProcessRequestNotification()
at System.Web.HttpRuntime.ProcessRequestNotificationPrivate()
at System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper()
at System.Web.Hosting.PipelineRuntime.ProcessRequestNotification()
at System.Web.Hosting.UnsafeIISMethods.MgdIndicateCompletion()
at System.Web.Hosting.UnsafeIISMethods.MgdIndicateCompletion()
at System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper()
at System.Web.Hosting.PipelineRuntime.ProcessRequestNotification()

A NodeUnavailableException could be returned for any number of network related issues...However, since you mentioned you are running on AWS, it's likely the TCP keep-alives settings needs to be tuned on the client.
Your MinSize connections (200) is so large, that you are not likely using them all and they are sitting by idly until the AWS LB decides to shut them down. When this happens the SDK will temporarily put the node (1000ms) that failed into a down state and then try to reconnect. During this time any keys mapped to it will fail with that exception.
This blog describes how to set the TCP keep-alives time and interval: http://blog.couchbase.com/introducing-couchbase-.net-sdk-2.1.0-the-asynchronous-couchbase-.net-client
var config = new ClientConfiguration
{
EnableTcpKeepAlives = true, //default it true
TcpKeepAliveTime = 1000*60*60, //set to 60mins
TcpKeepAliveInterval = 5000 //KEEP ALIVE will be sent every 5 seconds after 1hr
};
var cluster = new Cluster(config);
var bucket = cluster.OpenBucket();
That assumes you are using version 2.1.0 or greater of the client. If you are not, you can do it through the ServicePointManager:
//setting keep-alive time to 200 seconds
ServicePointManager.SetTcpKeepAlive(true, 200000, 1000);
You'll have to set that that to a value less than what the AWS LB is set to (I believe it's 60 seconds).
You should also probably set your connection pool min and max a bit lower, like 5 and 10.

Even though the problem was not fully solved since we still encounter timeouts but at a lower rate, we increased the performance by using the ClusterHelper singleton instance as follows:
ClusterHelper.Initialize(
new ClientConfiguration
{
Servers = serverList,
UseSsl = false,
DefaultOperationLifespan = 2500,
EnableTcpKeepAlives = true,
TcpKeepAliveTime = 1000*60*60,
TcpKeepAliveInterval = 5000,
BucketConfigs = new Dictionary<string, BucketConfiguration>
{
{
"default",
new BucketConfiguration
{
BucketName = "default",
UseSsl = false,
Password = "",
PoolConfiguration = new PoolConfiguration
{
MaxSize = 50,
MinSize = 10
}
}
}
}
});

Related

AWSSDK s3 .net question. My multipart upload crashes after 15mins of transfer

I'm uploading rather a lot of data (30gb+) across thousands of files. The whole process takes a while but I've been finding that consistently after 15 mins of transfers, the upload process fails and I get errors for each file that is currently being transferred (I'm doing it multithreaded so there are multiple uploads at once). The error code I'm getting is "error: Amazon.S3.AmazonS3Exception: The difference between the request time and the current time is too large. ---> Amazon.Runtime.Internal.HttpErrorResponseException: The remote server returned an error: (403) Forbidden. ---> System.Net.WebException: The remote server returned an error: (403) Forbidden."
Seeing as its exactly 15 mins from the start of the whole process that this thing crashes, I think its maybe that the client is timing out, however I've set my client's timout to 45 mins I think:
{
var client = new AmazonS3Client(new AmazonS3Config()
{
RegionEndpoint = RegionEndpoint.EUWest2,
UseAccelerateEndpoint = true,
Timeout = TimeSpan.FromMinutes(45),
ReadWriteTimeout = TimeSpan.FromMinutes(45),
RetryMode = RequestRetryMode.Standard,
MaxErrorRetry = 10
});
Parallel.ForEach(srcObjList, async srcObj =>
{
try
{
var putObjectRequest = new PutObjectRequest();
putObjectRequest.BucketName = destBucket;
putObjectRequest.Key = srcObj.Key;
putObjectRequest.FilePath = filePathString;
putObjectRequest.CannedACL = S3CannedACL.PublicRead;
var uploadTask = client.PutObjectAsync(putObjectRequest);
lock (threadLock)
{
syncTasks.Add(uploadTask);
}
await uploadTask;
}
catch (Exception e)
{
Debug.LogError($"Copy task ({srcObj.Key}) failed with error: {e}");
throw;
}
});
try
{
await Task.WhenAll(syncTasks.Where(x => x != null).ToArray());
}
catch (Exception e)
{
Debug.LogError($"Upload encountered an issue: {e}");
}
});
await transferOperations;
Debug.Log("Done!");```
The documentation doesn't specify the maximum timeout value, but given that you're seeing 15 minutes exactly, it stands to reason there is some upper limit to the timeout value, either a hard limit or something in the S3 bucket's settings.
This answer suggests a clock synchronization difference might also be the case, but then I'd wonder why the transfer starts at all.

Set WaitTimeOut for Azure Service Bus Session Processor

In the legacy version of Azure Service Bus (ASB) I can use MessageWaitTimeout in SessionHandlerOptions to control the timeout between 2 messages. For example, if I set timeout 5 seconds, after complete the first message, the queue waits for 5s then picks the next message.
In the new version Azure.Messaging.ServiceBus, the queue has to wait for around 1 minute to pick up the next message. I only need to process one-by-one messages, no need to process concurrent messages.
I follow this example and can't find any solution to set timeout like the old version.
Does anyone know how to do it?
var options = new ServiceBusSessionProcessorOptions
{
AutoCompleteMessages = false,
MaxConcurrentSessions = 1,
MaxConcurrentCallsPerSession = 1,
MaxAutoLockRenewalDuration = TimeSpan.FromMinutes(2),
};
EDIT:
I found the solution. It is RetryOptions in ServiceBusClient
var client = new ServiceBusClient("connectionString", new ServiceBusClientOptions
{
RetryOptions = new ServiceBusRetryOptions
{
TryTimeout = TimeSpan.FromSeconds(5)
}
});
With the latest stable release, 7.2.0, this can be configured with the SessionIdleTimeout property.

Azure Cosmos db throwing Socket Exceptions

I am using azure cosmos db with .net core 2.1 application. I am using gremlin driver with this. It's working fine but after every few days it start throwing socket exception on server and we have to recycle IIS pool. Average per day hits are 10000.
Now we are using default gateway mode. Should we have to switch to direct mode as it might be a firewall issue ?
Here is the implementation:
private DocumentClient GetDocumentClient( CosmosDbConnectionOptions configuration)
{
_documentClient = new DocumentClient(
new Uri(configuration.Endpoint),
configuration.AuthKey,
new ConnectionPolicy());
//create database if not exists
_documentClient.CreateDatabaseIfNotExistsAsync(new Database { Id = configuration.Database });
return _documentClient;
}
and in startup.cs:
services.AddSingleton(x => GetDocumentClient(cosmosDBConfig));
and here is how we are communicating with cosmos db:
private DocumentClient _documentClient;
private DocumentCollection _documentCollection;
private CosmosDbConnectionOptions _cosmosDBConfig;
public DocumentCollectionFactory(DocumentClient documentClient, CosmosDbConnectionOptions cosmosDBConfig)
{
_documentClient = documentClient;
_cosmosDBConfig = cosmosDBConfig;
}
public async Task<DocumentCollection> GetProfileCollectionAsync()
{
if (_documentCollection == null)
{
_documentCollection = await _documentClient.CreateDocumentCollectionIfNotExistsAsync(
UriFactory.CreateDatabaseUri(_cosmosDBConfig.Database),
new DocumentCollection { Id = _cosmosDBConfig.Collection },
new RequestOptions { OfferThroughput = _cosmosDBConfig.Throughput });
return _documentCollection;
}
return _documentCollection;
}
and then:
public async Task CreateProfile(Profile profile)
{
var graphCollection = await _graphCollection.GetProfileCollectionAsync();
var createQuery = GetCreateQuery(profile);
IDocumentQuery<dynamic> query = _documentClient.CreateGremlinQuery<dynamic>(graphCollection, createQuery);
if(query.HasMoreResults)
{
await query.ExecuteNextAsync();
}
}
I'm assuming that for communication with CosmosDB you are using HttpClient. The application should share a single instance of HttpClient.
Every time you make a connection after HttpClient disposal there are still a bunch of connections in the state of TIME_WAIT. This means that the connection was closed on one side ( OS ) but it is in "waiting for additional packets" state.
By default, Windows may hold this connection in this state for 240 seconds. There is a limit to how quickly OS can open new sockets. All this may lead to System.Net.Sockets.SocketException exception.
Very good article that explains in details why and how this problem appears digging into TCP diagram and explaining with more details.
UPDATED
Possible solution.
You are using the default ConnectionPolicy object. That object has a property called IdleTcpConnectionTimeout which controls the amount of idle time after which unused connections are closed. By default, idle connections are kept open indefinitely. The value must be greater than or equal to 10 minutes.
So the code could look like:
private DocumentClient GetDocumentClient( CosmosDbConnectionOptions configuration)
{
_documentClient = new DocumentClient(
new Uri(configuration.Endpoint),
configuration.AuthKey,
new ConnectionPolicy() {
IdleTcpConnectionTimeout = new TimeSpan(0,0,10,0)
});
//create database if not exists
_documentClient.CreateDatabaseIfNotExistsAsync(new Database { Id = configuration.Database });
return _documentClient;
}
Here is a link to ConnectionPolicy Class documentation

Error in Configuring Azure Storage Account after using Retry Policies

This is my code to Configure Azure Storage Account
public CloudTableClient ConfigureStorageAccount()
{
var storageCred = new StorageCredentials(ConfigurationManager.AppSettings["SASToken"]);
CloudTableClient = new CloudTableClient(
new StorageUri(new Uri(ConfigurationManager.AppSettings["StorageAccountUri"])), storageCred);
var backgroundRequestOption = new TableRequestOptions()
{
// Client has a default exponential retry policy with 4 sec delay and 3 retry attempts
// Retry delays will be approximately 3 sec, 7 sec, and 15 sec
MaximumExecutionTime = TimeSpan.FromSeconds(30),
// PrimaryThenSecondary in case of Read-access geo-redundant storage, else set this to PrimaryOnly
LocationMode = LocationMode.PrimaryThenSecondary,
};
CloudTableClient.DefaultRequestOptions = backgroundRequestOption;
return CloudTableClient;
}
When I specify backgroundRequestOption I am getting an error The Uri for the target storage location is not specified. Please consider changing the request's location mode.
When I don't specify backgroundRequestOption I don't get any error. Where do I need to specify this URI?
You need to specify both PrimaryUri and SecondaryUri if LocationMode.PrimaryThenSecondary is chosen.

Windows Filtering Platform - How can I block incoming connections based on local port?

I'm attempting to set up some filters using WFP to block inbound connections to a local server (for example, a webserver listening on port 8080).
I've got a filter working which can block based on Remote Port, so I can stop processes on my machine from establishing any connections to port 8080, but I can't figure out how to block incoming connections from another machine based on the local port 8080?
Here's my code which works to block based on remote port:
(It's C# using P/invoke but it's pretty much the same as if it were written in C++)
var RemotePort = 8080 # port to block
// connect to engine
var session = new Fwpm.FWPM_SESSION0 { flags = Fwpm.FWPM_SESSION_FLAG_DYNAMIC };
UInt32 engineHandle;
UnsafeNativeMethods.FwpmEngineOpen0(null, Fwpm.RPC_C_AUTHN_WINNT, IntPtr.Zero, session, out engineHandle
// create a subLayer to attach filters to
var subLayerGuid = Guid.NewGuid();
var subLayer = new Fwpm.FWPM_SUBLAYER0();
subLayer.subLayerKey = subLayerGuid;
subLayer.displayData.name = DisplayName;
subLayer.displayData.description = DisplayName;
subLayer.flags = 0;
subLayer.weight = 0x100;
UnsafeNativeMethods.FwpmSubLayerAdd0(engineHandle, subLayer, IntPtr.Zero)
var condition = new Fwpm.FWPM_FILTER_CONDITION0 {
fieldKey = Fwpm.FWPM_CONDITION_IP_REMOTE_PORT,
matchType = Fwpm.FWP_MATCH_TYPE.FWP_MATCH_EQUAL,
conditionValue = {
type = Fwpm.FWP_DATA_TYPE.FWP_UINT16,
uint16 = RemotePort
}
}
// create the filter itself
var fwpFilter = new Fwpm.FWPM_FILTER0();
fwpFilter.layerKey = Fwpm.FWPM_LAYER_ALE_AUTH_CONNECT_V4;
fwpFilter.action.type = Fwpm.FWP_ACTION_BLOCK;
fwpFilter.subLayerKey = subLayerGuid;
fwpFilter.weight.type = Fwpm.FWP_DATA_TYPE.FWP_EMPTY; // auto-weight.
fwpFilter.numFilterConditions = (uint)1;
var condsArray = new[]{ condition };
var condsPtr = SafeNativeMethods.MarshalArray(condsArray); // helper to create a native array from a C# one
fwpFilter.filterCondition = condsPtr;
fwpFilter.displayData.name = DisplayName;
fwpFilter.displayData.description = DisplayName;
// add the filter
UInt64 filterId = 0L;
UnsafeNativeMethods.FwpmFilterAdd0(engineHandle, ref fwpFilter, IntPtr.Zero, out filterId));
As mentioned above, this code does work to block connections with remote port of 8080. To block connections with Local Port 8080, I modified the code as follows:
var LocalPort = 8080;
var condition = new Fwpm.FWPM_FILTER_CONDITION0 {
fieldKey = Fwpm.FWPM_CONDITION_IP_LOCAL_PORT,
matchType = Fwpm.FWP_MATCH_TYPE.FWP_MATCH_EQUAL,
conditionValue = {
type = Fwpm.FWP_DATA_TYPE.FWP_UINT16,
uint16 = LocalPort
}
}
// create the filter itself
var fwpFilter = new Fwpm.FWPM_FILTER0();
fwpFilter.layerKey = Fwpm.FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4;
MSDN implies that FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4 is the right place to block inbound connections, however this doesn't work at all. I've tried FWPM_LAYER_ALE_RESOURCE_ASSIGNMENT_V4 as well as a few other layers, but no matter what I've tried, I am always able to establish connections from another machine to a server on port 8080 on my machine.
Any help would be much appreciated
You should be able to create that filter on any of the INBOUND or RECV layers that support the FWPM_CONDITION_IP_LOCAL_PORT condition, the resource to search for that is:
http://msdn.microsoft.com/en-us/library/windows/hardware/ff549939%28v=vs.85%29.aspx
However, not all traffic passes through every layer, I am by no means an expert but one approach is to add a filter like that to every applicable layer (a half dozen or so filers) and see if that works. If so you then remove the filters one at a time till you find the set that was actually needed. There were 4 layers I needed on a recent project to stop all the traffic I was interested in.
One big caveat that may be worth noting is that traffic on localhost may not go through any WFP layers (or perhaps it was only inbound layers it skipped, I don't remember). So you can use WFP to prevent a remote connection to the port, but a local connection may still go through.

Categories