Azure Service Fabric inter service communication - c#

I currently have one Service Fabric application that is composed of multiple Services. What I'm trying to achieve is a Queuing mechanism so one Service can publish a message to a queue, and another Service can receive messages from the same queue.
The following doesn't work (for the Listener service, there is nothing to dequeue):
PublisherService:
protected override async Task RunAsync(CancellationToken cancellationToken)
{
var myQueue = await StateManager.GetOrAddAsync<IReliableQueue<string>>("fooQueue");
while (true)
{
cancellationToken.ThrowIfCancellationRequested();
using (var tx = this.StateManager.CreateTransaction())
{
// Put some message in the queue
await myQueue.EnqueueAsync(tx, "Foobar");
await tx.CommitAsync();
}
await Task.Delay(TimeSpan.FromSeconds(1), cancellationToken);
}
}
ListenerService:
protected override async Task RunAsync(CancellationToken cancellationToken)
{
var myQueue = await StateManager.GetOrAddAsync<IReliableQueue<string>>("fooQueue");
while (true)
{
cancellationToken.ThrowIfCancellationRequested();
using (var tx = this.StateManager.CreateTransaction())
{
var result = await myQueue.TryDequeueAsync(tx);
if (result.HasValue)
{
ServiceEventSource.Current.ServiceMessage(this.Context, "New message receieved: {0}", result.Value.ToString());
}
await tx.CommitAsync();
}
await Task.Delay(TimeSpan.FromSeconds(1), cancellationToken);
}
}
It looks like the scope of a queue is limited to a single Service. This doesn't appear to be a limitation specified in the documentation.
So my questions are:
is this actually some undocumented limitation?
or is there something wrong in the code above?
how could I achieve the scenario above (one service adds messages to a queue, another service retrieves messages from the same queue)?
Obviously I could use an Azure Service Bus, but I can't for several reasons:
in my actual real-world scenario, I will have several queues (variable number) so it would require creating Service Bus Queues on demand (which is not exactly a fast operation)
adds a dependency to another Azure service (so increases the failure probability for the whole system)
costs more
more complex deployment
etc.

ReliableQueues are local to a service yes, because its intent is to store state for that particular service. That state is replicated to other instances. It is like a normal System.Collections.Generic.Queue<T> in .Net.
For a low cost solution maybe you can use Azure Storage Queues. Yes, it adds a dependency but it has a high availability. It is a tradeoff that only you can decide to accept or not.
On the other hand, think out of the box:
Create a stateful service with multiple ReliableQueues and expose methods other services can call using stand remoting communication like:
class QueuingService
{
void AddToQueue<T>(string queuename, T input) { .. }
void DeQueue(string queuename) { .. }
}
This creates of course a dependency but it has all the safety mechanisms Service Fabric provides and does not cost you much. But then again, you are building a poor mans service bus/azure storage queue yourself.
About the docs, no it does not says so with many words that a reliable queue is tied to 1 service but it depends on how you interpret this
Service Fabric offers a stateful programming model available to .NET developers via Reliable Collections. Specifically, Service Fabric provides reliable dictionary and reliable queue classes. When you use these classes, your state (my interpretation: The state of the service) is partitioned (for scalability), replicated (for availability), and transacted within a partition (for ACID semantics).

Check out the Priority Queue Service, which was created for this purpose.

If you add a fault-handling retry pattern to all of your calling code, you should not need a queue in between your calls, see https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-communication
Relevant part from the link is here:
An exception handler is responsible for determining what action to take when an exception occurs. Exceptions are categorized into retryable and non retryable.
Non retryable exceptions simply get rethrown back to the caller.
retryable exceptions are further categorized into transient and non-transient.
Transient exceptions are those that can simply be retried without re-resolving the service endpoint address. These will include transient network problems or service error responses other than those that indicate the service endpoint address does not exist.
Non-transient exceptions are those that require the service endpoint address to be re-resolved. These include exceptions that indicate the service endpoint could not be reached, indicating the service has moved to a different node.

Related

NServiceBus, how do you do an action before a message is sent to the error queue?

I am using NService to create an endpoint.
The endpoint is listening to an event and do some calculation, then publish result (success or fail) to other endpoints
I know that NServiceBus support ImmediateRetry and DelayRetry, and they are configurable.
Now, I want to publish a fail result event to other endpoints after all retries (before sending to error queue).
public async Task Handle(MyEvent message, IMessageHandlerContext context)
{
Console.WriteLine($"Received MyEvent, ID = {message.Id}");
//Connect to other services to get data and do some calculation
Thread.Sleep(1000);
Console.WriteLine($"Processed MyEvent, ID = { message.Id}");
await context.Publish(new MyEventResult { IsSucceed = true });
}
Above is my current code. It will publish a successful result if there is no exception throw. But If it has a fatal exception, I don't know how to publish a fail result event before the message is sent to the error queue.
Thanks in advance.
Notes: I am using NServiceBus 6.4.3
I'm not sure why you want this but have you looked at NServiceBus sagas? They are intended to be used when having to doing blocking IO via (external) services. You can take alternative actions based on the fact if a specific task hasn't been performed within an allocated period or because the returned result was incorrect.
https://docs.particular.net/nservicebus/sagas/
See the following sample of a saga:
https://docs.particular.net/samples/saga/simple/
The following is a sample showing the usage of saga timeouts. If specific task has not been performed within a specific duration an alternative action can be performed like publishing an event or performing a ReplyToOriginator
https://docs.particular.net/nservicebus/sagas/timeouts
https://docs.particular.net/nservicebus/sagas/reply-replytooriginator-differences
https://docs.particular.net/nservicebus/sagas/#notifying-callers-of-status
By using sagas you are making your process explicit. I would avoid hooking into the recovery mechanism for this.
The recovery mechanism is meant to deal with transient errors like network connectivity issues, database deadlocks, etc. but not with expected failure results. You should properly process these and continue your modeled process in its unhappy path.

Simulate 10,000 Azure IoT Hub Device connections from Azure Service Fabric cluster

We are developing a .Net Core service that shall be hosted in Azure Service Fabric. This SF Service needs to interact with 10,000 devices registered in Azure IoT Hub via it's AMQP 1.0 SSL TLS endpoints. Each IoT Hub devices has it's own security tokens and connection string provided by the IoT Hub service.
For our scenario we need to listen to all cloud-to-devices messages coming from the 10,000 IoT Hub device instances and "route" these to a central Service Bus topic to which the actual "gateways" in the field listen to. So basically we want to forward messages from 10,000 Service Bus Queues into one central Queue.
What is the best approach to handle these 10,000 AMQP listners from a SF Service? Is there a way we can reuse AMQP connections, sessions or links so we cache/share resources? And how can we dynamically spread the load of connection maintenance over the 5 nodes in the SF cluster?
We are evaluating these Nuget packages for the implementation:
Microsoft.Azure.ServiceBus
AMQPNetLite
Microsoft.Azure.Devices.Client
We are doing some tests using the Microsoft.Azure.Devices.Client lib, see a simplified code sample below:
using System;
using System.Fabric;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using Microsoft.Azure.Devices.Client;
using Microsoft.ServiceFabric.Services.Runtime;
namespace ID.Monitoring.MonServer.ServiceFabric.ServiceBus
{
/// <summary>
/// An instance of this class is created for each service instance by the Service Fabric runtime.
/// </summary>
internal sealed class ServiceBus : StatelessService
{
private readonly DeviceClient _deviceClient;
private ConnectionStatus _status;
public ServiceBus(StatelessServiceContext context)
: base(context)
{
_deviceClient = DeviceClient.CreateFromConnectionString("HostName=id-monitoring-dev.azure-devices.net;DeviceId=100;SharedAccessSignature=SharedAccessSignature sr=id-monitoring-dev.azure-devices.net%2Fdevices%2F100&sig={token}&se=1553265888", TransportType.Amqp_Tcp_Only);
}
/// <summary>
/// This is the main entry point for your service instance.
/// </summary>
/// <param name="cancellationToken">Canceled when Service Fabric needs to shut down this service instance.</param>
protected override async Task RunAsync(CancellationToken cancellationToken)
{
_deviceClient.SetConnectionStatusChangesHandler(ConnectionStatusChangeHandler);
while (true)
{
if (_status != ConnectionStatus.Connected)
{
await _deviceClient.OpenAsync();
}
var receivedMessage = await _deviceClient.ReceiveAsync(TimeSpan.FromSeconds(10)).ConfigureAwait(false);
if (receivedMessage != null)
{
var messageData = Encoding.ASCII.GetString(receivedMessage.GetBytes());
//TODO: handle incoming message and publish to common
await _deviceClient.CompleteAsync(receivedMessage).ConfigureAwait(false);
}
}
}
private void ConnectionStatusChangeHandler(ConnectionStatus status, ConnectionStatusChangeReason reason)
{
_status = status;
}
}
}
Question: Does this scale well to 10,000 Service Fabric service instances? Or are there more efficient ways to have this many AMQP Service Bus Listners maintained from a Service Fabric Service environment? Is there a way we can apply AMQP connection multiplexing maybe?
Take a look at this.
The second answer provides a sample that allows you to multiplex multiple devices onto one Amqp connection.
The approach you choose to monitor your devices won't scale well and will be hard to maintain.
Currently, service fabric has a limitation of how many instances you can place in a single node. For example: if you create an application with your ServiceBus service and span 10000 instances, you will hit this limitation, that is the number of nodes. i.e: if you have a 5 node cluster, you will be able to run only 5 instances of your service by using the default scaling approach.
To bypass this issue you have some options:
Partitioning:
To have a single stateless service running more
partitions than the node count, you have to partition your service.
Assuming you have a 5 node cluster and need 10000 instances, you will
need 2000 partitions running on each node. If you use shared process and have enough ram to this, this approach might help you, please take a look at this thread and this thread before following this approach
Multiple Named Services:
Named service is the running service definition for one service type, in this case you would create one per device. like:
ServiceBusType
ServiceBus-Device1
ServiceBus-Device2
ServiceBus-Device3
This approach will consume too much resources in your machine, as you will be running one instance for each device, but easy to manage, as you can span new instances for each new device without affecting other running services.
Parallel Processing per instance:
Where each instance, would be responsible for processing multiple messages concurrently, in this case you would create 2000 connections for each instance(if running in a 5 instance/node per cluster). This will be lighter than the other approaches on resources consumption, but is a bit harder to maintain, as you will have to handle the balance yourself and might need an extra service to monitor and delegate tasks to all the services and ensure the messages are being processing evenly.
Summary:
One instance handling one connection at one message a time will required 10000 instances of your service, the partitioning will be similar but you can use a shared process to reduce memory consumption, but the memory consumption will still be high in both cases.
Multiple named services could be an option if the number of services were not too high, You also wouldn't be able to share the connection. So I won't recommend this approach for your scenario.
The third option, is the more resource friendly but you will have to find a way to partition the connections evenly throughout the cluster nodes.
You can also use a mixed approach, for example, you can have service handling multiple messages in parallel and a partitioned service to define the key range of devices.
Please take a look in the links I've mentioned.
I found that there is a DeviceClient constructor that allows the AmqpConnectionPoolSettings to be set.

Accessing Service Fabric service state during cancellation

I have a Service Fabric cluster hosting an 'Orchestrator'-type service which spins up and shuts down other Stateful services to do work, using FabricClient.ServiceManagementClient's CreateServiceAsync and DeleteServiceAsync methods.
The work involves processing messages which are stored for a short time within a ReliableConcurrentQueue.
I'm trying to handle the graceful shutdown of these services via the CancellationToken by ensuring that the queue is completely drained of messages before the service is deleted, but have found that the service's access to the ReliableConcurrentQueue is revoked once the CancellationToken is cancelled.
For example, calling StateManager.GetOrAddAsync<T>() from a callback registered with the CancellationToken, results in a FabricNotReadableException, containing the message "Primary state manager is currently not readable".
Reading around, it seems this is expected behaviour:
"In Service Fabric, when a Primary is demoted, one of the first things
that happens is that write access to the underlying state is revoked."
https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-lifecycle
Also, the answers to this question suggest that FabricNotReadableException is often a transient issue, and affected calls can be retried. This doesn't seem to be the case in this example; multiple retries at various frequencies/delays all seem to fail the same way.
Is there a way to guarantee that everything in the queue is processed using the combination of Stateful services, Reliable Collections and CancellationTokens? Or should I be looking into storage outside of what Service Fabric can provide?
Consider performing the queue item processing inside RunAsync.
Stopping / changing the role of a service causes the CancellationToken passed to RunAsync to be cancelled.
Once that happens, you need to make sure that you only exit that method when the queue depth is 0.
Also, once this cancellation is requested, you should probably stop allowing new items to be enqueued.

TxSelect and TransactionScope

Recently, I've been checking out RabbitMQ over C# as a way to implement pub/sub. I'm more used to working with NServiceBus. NServiceBus handles transactions by enlisting MSMQ in a TransactionScope. Other transaction aware operations can also enlist in the same TransactionScope (like MSSQL) so everything is truly atomic. Underneath, NSB brings in MSDTC to coordinate.
I see that in the C# client API for RabbitMQ there is a IModel.TxSelect() and IModel.TxCommit(). This works well to not send messages to the exchange before the commit. This covers the use case where there are multiple messages sent to the exchange that need to be atomic. However, is there a good way to synchronize a database call (say to MSSQL) with the RabbitMQ transaction?
You can write a RabbitMQ Resource Manager to be used by MSDTC by implementing the IEnlistmentNotification interface. The implementation provides two phase commit notification callbacks for the transaction manager upon enlisting for participation. Please note that MSDTC comes with a heavy price and will degrade your overall performance drastically.
Example of RabbitMQ resource manager:
sealed class RabbitMqResourceManager : IEnlistmentNotification
{
private readonly IModel _channel;
public RabbitMqResourceManager(IModel channel, Transaction transaction)
{
_channel = channel;
_channel.TxSelect();
transaction.EnlistVolatile(this, EnlistmentOptions.None);
}
public RabbitMqResourceManager(IModel channel)
{
_channel = channel;
_channel.TxSelect();
if (Transaction.Current != null)
Transaction.Current.EnlistVolatile(this, EnlistmentOptions.None);
}
public void Commit(Enlistment enlistment)
{
_channel.TxCommit();
enlistment.Done();
}
public void InDoubt(Enlistment enlistment)
{
Rollback(enlistment);
}
public void Prepare(PreparingEnlistment preparingEnlistment)
{
preparingEnlistment.Prepared();
}
public void Rollback(Enlistment enlistment)
{
_channel.TxRollback();
enlistment.Done();
}
}
Example using resource manager
using(TransactionScope trx= new TransactionScope())
{
var basicProperties = _channel.CreateBasicProperties();
basicProperties.DeliveryMode = 2;
new RabbitMqResourceManager(_channel, trx);
_channel.BasicPublish(someExchange, someQueueName, basicProperties, someData);
trx.Complete();
}
As far as I'm aware there is no way of coordinating the TxSelect/TxCommit with the TransactionScope.
Currently the approach that I'm taking is using durable queues with persistent messages to ensure they survive RabbitMQ restarts. Then when consuming from the queues I read a message off do some processing and then insert a record into the database, once all this is done I ACK(nowledge) the message and it is removed from the queue. The potential problem with this approach is that the message could end up being processed twice (if for example the message is committed to the DB but say the connection to RabbitMQ is disconnected before the message can be ack'd), but for the system that we're building we're concerned about throughput. (I believe this is called the "at-least-once" approach).
The RabbitMQ site does say that there is a significant performance hit using the TxSelect and TxCommit so I would recommend benchmarking both approaches.
However way you do it, you will need to ensure that your consumer can cope with the message potentially being processed twice.
If you haven't found it yet take a look at the .Net user guide for RabbitMQ here, specifically section 3.5
Lets say you've got a service bus implementation for your abstraction IServiceBus. We can pretend it's rabbitmq under the hood, but it certainly doesn't need to be.
When you call servicebus.Publish, you can check System.Transaction.Current to see if you're in a transaction. If you are and it's a transaction for a mssql server connection, instead of publishing to rabbit you can publish to a broker queue within sql server which will respect the commit/rollback with whatever database operation you're performing (you want to do some connection magic here to avoid the broker publish upgrading your txn to msdtc)
Now you need to create a service that needs to read the broker queue and do an actual publish to rabbit, this way, for very important things, you can gaurantee that your database operation completed previously and that the message gets published to rabbit at some point in the future (when the service relays it). its still possible for failures here if when committing the broker receive an exception occurs, but the window for problems is drastically reduced and worse case scenario you would end up publishing multiple times, you would never lose a message. This is very unlikely, the sql server going offline after receive but before commit would be an example of when you would end up at minimum double publishing (when the server comes on-line you'd publish again) You can build your service smart to mitigate some, but unless you use msdtc and all that comes with it (yikes) or build your own msdtc (yikes yikes) you are going to have potential failures, it's all about making the window small and unlikely to occur.

WCF Windows Service - Long operations/Callback to calling module

I have a Windows Service that takes the name of a bunch of files and do operations on them (zip/unzip, updating db etc). The operations can take time depending on size and number of files sent to the service.
(1) The module that is sending a request to this service waits until the files are processed. I want to know if there is a way to provide a callback in the service that will notify the calling module when it is finished processing the files. Please note that multiple modules can call the service at a time to process files so the service will need to provide some kind of a TaskId I guess.
(2) If a service method is called and is running and another call is made to the same service, then how will that call be processed(I think there is only one thread asociated with the service). I have seen that when the service is taking time in processing a method, the threads associated with the service begin to increase.
WCF does indeed offer duplex bindings which allow you to specify a callback contract, so that the service can call back to the calling client to notify.
However, in my opinion, this mechanism is rather flaky and not really to be recommended.
In such a case, when the call causes a fairly long running operation to happen, I would do something like this:
If you want to stick to HTTP/NetTcp bindings, I would:
drop off the request with the service, and then "let go" - this would be a one-way call, you just drop off what you want to have done, and then your client is done
have a status call that the client could call after a given time to find out whether or not the results of the request are ready by now
if they are, there should be a third service call to retrieve the results
So in your case, you could drop off the request to zip some files. The service would go off and do its work and store the resulting ZIP in a temporary location. Then later on the client could check to see whether the ZIP is ready, and if so, retrieve it.
This works even better over a message queue (MSMQ) which is present in every Windows server machine (but not a lot of people seem to know about it or use it):
your client drops off the request on a request queue
the service listens on that request queue and fetches request after request and does it works
the service can then post the results to a result queue, on which your callers in turn are listening
Check out how to do all of this efficiently by reading the excellent MSDN article Foudnations: Build a queue WCF Response Service - highly recommended!
A message-queue based systems tends to be much more stable and less error-prone that a duplex-/callback-contract based system, in my opinion.
(1) The simplest way to achieve this is with a taskId as you note, and then have another method called IsTaskComplete with which client can check whether the task has been completed.
(2) Additional calls made to the service will start new threads.
edit: the default service behaviour is to start new threads per call. The configurable property is Instance Context Mode, and can be set to PerCall, PerSession, or Shareable.
The question has a solution, but I'm using a WCF duplex service to get the result of a long operation, and even though I found a problem that has cost me several hours to solve (and that's why I searched this question earlier), now it works perfectly, and I believe it is a simple solution within the WCF duplex service framework.
What is the problem with a long operation? The main problem is blocking the client interface while the server performs the operation, and with the WCF duplex service we can use a call back to the client to avoid the blockage (It is an old method to avoid blocking but it can easily be transformed into the async/await framework using a TaskCompletionSource).
In short, the solution uses a method to start the operation asynchronously on the server and returns immediately. When the results are ready, the server returns them by means of the client call back.
First, you have to follow any standard guide to create WCF duplex services and clients, and I found these two useful:
msdn duplex service
Codeproject Article WCF Duplex Service
Then follow these steps adding your own code:
Define the call back interface with an event manager method to send results from the server and receive them in the client.
public interface ILongOperationCallBack
{
[OperationContract(IsOneWay = true)]
void OnResultsSend(....);
}
Define the Service Interface with a method to pass the parameters needed by the long operation (refer the previous ILongOperationCallBack interface in the CallBackContractAttribute)
[ServiceContract(CallbackContract=typeof(ILongOperationCallBack))]
public interface ILongOperationService
{
[OperationContract]
bool StartLongOperation(...);
}
In the Service class that implements the Service Interface, first get the proxy of the client call back and save it in a class field, then start the long operation work asynchronously and return the bool value immediately. When the long operation work is finished send the results to the client using the client call back proxy field.
public class LongOperationService:ILongOperationService
{
ILongOperationCallBack clientCallBackProxy;
public ILongOperationCallBack ClientCallBackProxy
{
get
{
return OperationContext.Current.GetCallbackChannel<ITrialServiceCallBack>());
}
}
public bool StartLongOperation(....)
{
if(!server.IsBusy)
{
//set server busy state
//**Important get the client call back proxy here and save it in a class field.**
this.clientCallBackProxy=ClientCallBackProxy;
//start long operation in any asynchronous way
......LongOperationWorkAsync(....)
return true; //return inmediately
}
else return false;
}
private void LongOperationWorkAsync(.....)
{
.... do work...
//send results when finished using the cached client call back proxy
this.clientCallBackProxy.SendResults(....);
//clear server busy state
}
....
}
In the client create a class that implements ILongOperationCallBack to receive results and add a method to start the long operation in the server (the start method and the event manager don't need to be in the same class)
public class LongOperationManager: ILongOperationCallBack
{
public busy StartLongOperation(ILongOperationService server, ....)
{
//here you can make the method async using a TaskCompletionSource
if(server.StartLongOperation(...)) Console.WriteLine("long oper started");
else Console.Writeline("Long Operation Server is busy")
}
public void OnResultsSend(.....)
{
... use long operation results..
//Complete the TaskCompletionSource if you used one
}
}
NOTES:
I use the bool return in the StartLongOperation method to indicate that the server is Busy as opposed to down, but it is only necessary when the long operation can't be concurrent as in my actual application, and maybe there are best ways in WCF to achieve non concurrency (to discover if the server is down, add a Try/Catch block as usual).
The important quote that I didn't see documented is the need to cache the call back client proxy in the StartLongOperation method. My problem was that I was trying to get the the proxy in the working method (yes, all the examples use the call back client proxy in the service method, but it isn't explicity stated in the documentation, and in the case of a long operation we must delay the call back until the operation ends).
Do not get and cache twice the call back Proxy after a service method has returned and before the next one.
Disclaimer: I haven't added code to control errors, etc.

Categories