I have a Service Fabric cluster hosting an 'Orchestrator'-type service which spins up and shuts down other Stateful services to do work, using FabricClient.ServiceManagementClient's CreateServiceAsync and DeleteServiceAsync methods.
The work involves processing messages which are stored for a short time within a ReliableConcurrentQueue.
I'm trying to handle the graceful shutdown of these services via the CancellationToken by ensuring that the queue is completely drained of messages before the service is deleted, but have found that the service's access to the ReliableConcurrentQueue is revoked once the CancellationToken is cancelled.
For example, calling StateManager.GetOrAddAsync<T>() from a callback registered with the CancellationToken, results in a FabricNotReadableException, containing the message "Primary state manager is currently not readable".
Reading around, it seems this is expected behaviour:
"In Service Fabric, when a Primary is demoted, one of the first things
that happens is that write access to the underlying state is revoked."
https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-lifecycle
Also, the answers to this question suggest that FabricNotReadableException is often a transient issue, and affected calls can be retried. This doesn't seem to be the case in this example; multiple retries at various frequencies/delays all seem to fail the same way.
Is there a way to guarantee that everything in the queue is processed using the combination of Stateful services, Reliable Collections and CancellationTokens? Or should I be looking into storage outside of what Service Fabric can provide?
Consider performing the queue item processing inside RunAsync.
Stopping / changing the role of a service causes the CancellationToken passed to RunAsync to be cancelled.
Once that happens, you need to make sure that you only exit that method when the queue depth is 0.
Also, once this cancellation is requested, you should probably stop allowing new items to be enqueued.
Related
Does the In Memory Outbox only work with an underlying messaging transport configured.
The documentation and some of the various posts I have read are leading to believe me that it will ONLY work with a specific underlying transport specified. It would be nice if that wasn't the case.
I say this as I have read discussion around the outbox and acknowledging messages "from a broker" and only once all processing has completed successfully - messages are acknowledged and publishing occurs.
So, when handling the messaging (i.e. via Amazon SQS) oneself and publishing messages into the state machine (i.e taking the transport message, creating a new message and then handing off to publishing to a consumer or saga state machine, how would the outbox know about and work with underlying transport messages.)
To be really clear, will the outbox work when using the following configuration (note the absence of any messaging transport configuration) :
services.AddMediator(configurator =>
{
configurator.AddConsumer<PublishMessageConsumer>();
configurator.AddSagaStateMachine<YetAnotherStateMachine, YetSomeMoreState>(
sagaConfigurator =>
{
sagaConfigurator.UseInMemoryOutbox();
}).DynamoDbRepository()
/// Snip
});
If it DOES work - if I wanted a consumer AND the Saga statemachine to work in concert such that the the Saga published to the Consumer and the Consumer failed for some reason. What would actually happen?
The sole purpose of the in-memory outbox is to defer calls to Send/Publish until after the consumer has completed. In the case of a saga, it means after the saga has been persisted to the saga repository after all state machine behaviors for the event have completed successfully (without throwing an exception).
In the case above, the saga would complete all activities for triggering event, the instance would be saved to the saga repository, and finally the consumer would be created/called by the Send/Publish call from the saga.
If the consumer throws an exception, it won't affect the already persisted saga instance in any way, as that has already completed.
NOW. If you do NOT use the in-memory outbox in this scenario, since it is using mediator (and not a transport), if you call Send/Publish in a state machine activity, control is transferred immediately to the consumer of the message sent/published. After that consumer completes, controls returns to the saga, which once the activities have completed would be persisted to the repository and the original message consumed by the saga completes, returning control to the original Send/Publish call.
Mediator is immediate, and any messages produced by consumers and/or sagas are consumed immediately as well.
I am using a Lab View application to simulate a test running, which would post a JSON string to my ASP.NET application. Within the ASP.NET application I format the data with the proper partition and row keys, then send it to Azure Table Storage.
The problem that I am having is that after what seems like a random amount of time (i.e. 5 minutes, 2 hours, 5 hours), the data fails to be saved into Azure. I am try to catch any exceptions within the ASP.NET application and send the error message back to the Lab View app and the Lab View app is also catching any exceptions in may encounter so I can trouble shoot where the issue is occurring.
The only error that I am able to catch is a Timeout Error 56 in the Lab View program. My question is, does anyone have an idea of where I should be looking for the root cause of this? I do not know where to begin.
EDIT:
I am using a table storage writer that I found here to do batch operations with retries.
The constructor for exponential retry policy is below:
public ExponentialRetry(TimeSpan deltaBackoff, int maxAttempts)
when you (or the library you use to be exact) instantiate this as RetryPolicy = new ExponentialRetry(TimeSpan.FromMilliseconds(2),100) you are basically setting the max attempts as 100 which means you may end up waiting up to around 2^100 milliseconds (there is some more math behind this but just simplifying) for each of your individual batch requests to fail on the client side until the sdk gives up retrying.
The other issue with that code is it executes batch requests sequentially and synchronously, that has multiple bad effects, first, all subsequent batch requests are blocked by the current batch request, second your cores are blocked waiting on I/O operations, third it has no exception handling so if one of the batch operations throw an exception, the method bails out and would not continue any further processing other batch requests.
My recommendation, do not use that library, batch operations are fairly straight forward. The default retry policy if you do not explicitly define is the exponential retry policy anyways with sensible default parameters (does 3 retries) so you do not even need to define your own retry object. For best scalability and throughput run your batch operations async (and concurrently).
As to why things fail, when you write your own api, catch the StorageException and check the http status code on the exception itself. You could be getting throttled by azure as one of the possibilities but it is hard to say without further debugging or you providing the http status code for the failed batch operations to us.
You need to check whether an exception is transient or not. As Peter said on his comment, Azure Storage client already implements a retry policy. You can also wrap your code with another retry code (e.g using polly) or you should change the default policy associated to Azure Storage Client.
I currently have one Service Fabric application that is composed of multiple Services. What I'm trying to achieve is a Queuing mechanism so one Service can publish a message to a queue, and another Service can receive messages from the same queue.
The following doesn't work (for the Listener service, there is nothing to dequeue):
PublisherService:
protected override async Task RunAsync(CancellationToken cancellationToken)
{
var myQueue = await StateManager.GetOrAddAsync<IReliableQueue<string>>("fooQueue");
while (true)
{
cancellationToken.ThrowIfCancellationRequested();
using (var tx = this.StateManager.CreateTransaction())
{
// Put some message in the queue
await myQueue.EnqueueAsync(tx, "Foobar");
await tx.CommitAsync();
}
await Task.Delay(TimeSpan.FromSeconds(1), cancellationToken);
}
}
ListenerService:
protected override async Task RunAsync(CancellationToken cancellationToken)
{
var myQueue = await StateManager.GetOrAddAsync<IReliableQueue<string>>("fooQueue");
while (true)
{
cancellationToken.ThrowIfCancellationRequested();
using (var tx = this.StateManager.CreateTransaction())
{
var result = await myQueue.TryDequeueAsync(tx);
if (result.HasValue)
{
ServiceEventSource.Current.ServiceMessage(this.Context, "New message receieved: {0}", result.Value.ToString());
}
await tx.CommitAsync();
}
await Task.Delay(TimeSpan.FromSeconds(1), cancellationToken);
}
}
It looks like the scope of a queue is limited to a single Service. This doesn't appear to be a limitation specified in the documentation.
So my questions are:
is this actually some undocumented limitation?
or is there something wrong in the code above?
how could I achieve the scenario above (one service adds messages to a queue, another service retrieves messages from the same queue)?
Obviously I could use an Azure Service Bus, but I can't for several reasons:
in my actual real-world scenario, I will have several queues (variable number) so it would require creating Service Bus Queues on demand (which is not exactly a fast operation)
adds a dependency to another Azure service (so increases the failure probability for the whole system)
costs more
more complex deployment
etc.
ReliableQueues are local to a service yes, because its intent is to store state for that particular service. That state is replicated to other instances. It is like a normal System.Collections.Generic.Queue<T> in .Net.
For a low cost solution maybe you can use Azure Storage Queues. Yes, it adds a dependency but it has a high availability. It is a tradeoff that only you can decide to accept or not.
On the other hand, think out of the box:
Create a stateful service with multiple ReliableQueues and expose methods other services can call using stand remoting communication like:
class QueuingService
{
void AddToQueue<T>(string queuename, T input) { .. }
void DeQueue(string queuename) { .. }
}
This creates of course a dependency but it has all the safety mechanisms Service Fabric provides and does not cost you much. But then again, you are building a poor mans service bus/azure storage queue yourself.
About the docs, no it does not says so with many words that a reliable queue is tied to 1 service but it depends on how you interpret this
Service Fabric offers a stateful programming model available to .NET developers via Reliable Collections. Specifically, Service Fabric provides reliable dictionary and reliable queue classes. When you use these classes, your state (my interpretation: The state of the service) is partitioned (for scalability), replicated (for availability), and transacted within a partition (for ACID semantics).
Check out the Priority Queue Service, which was created for this purpose.
If you add a fault-handling retry pattern to all of your calling code, you should not need a queue in between your calls, see https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-communication
Relevant part from the link is here:
An exception handler is responsible for determining what action to take when an exception occurs. Exceptions are categorized into retryable and non retryable.
Non retryable exceptions simply get rethrown back to the caller.
retryable exceptions are further categorized into transient and non-transient.
Transient exceptions are those that can simply be retried without re-resolving the service endpoint address. These will include transient network problems or service error responses other than those that indicate the service endpoint address does not exist.
Non-transient exceptions are those that require the service endpoint address to be re-resolved. These include exceptions that indicate the service endpoint could not be reached, indicating the service has moved to a different node.
Let's say you have this console application:
static void Main(string[] args)
{
var httpClient = new HttpClient()
{ BaseAddress = new Uri("http://www.timesofmalta.com") };
var responseTask = httpClient.GetAsync("/");
}
Since the task is not awaited, the program reaches its end, finds no other foreground threads executing, and exits before any response is received. That's pretty clear because this is a console application.
Now let's say you have a WCF application, where a request similarly causes a task to be spawned, but does not await it. Let's say this task is long-running, and is fire-and-forget rather than anything like an HTTP GET.
In such a case, what happens to that task? Does the thread just die as in the console application, bringing down the task with it? Can this cause code occurring later in the task to not be executed?
It depends on how your WCF is hosted. Whenever the application exits, its threads will be torn down, and any outstanding asynchronous operations are simply dropped.
Note that if WCF is hosted in ASP.NET, then fire-and-forget is dangerous; ASP.NET will recycle your app periodically just to keep things clean, and at that time your fire-and-forget operation can disappear. ASP.NET provides APIs to register work like this (if you absolutely must do it in-process).
If you're running on another host, you'll have to take care of registering with that host, using whatever technique is available.
Or, you can introduce a proper distributed architecture: the WCF endpoint merely serializes a description of the work to be done into a reliable queue (Azure queue / MSMQ / WebSphereMQ / etc), and an independent background worker processes that work (Azure webjob / Azure worker role / Win32 service / etc). This is a more complex setup but fixes the "lost work" problem you can get if you try to have your WCF app do it all in-process.
I want to create a simple client-server example in WCF. I did some testing with callbacks, and it works fine so far. I played around a little bit with the following interface:
[ServiceContract(SessionMode = SessionMode.Required, CallbackContract = typeof(IStringCallback))]
public interface ISubscribeableService
{
[OperationContract]
void ExecuteStringCallBack(string value);
[OperationContract]
ServerInformation Subscribe(ClientInformation c);
[OperationContract]
ServerInformation Unsubscribe(ClientInformation c);
}
Its a simple example. a little bit adjusted. You can ask the server to "execute a string callback" in which case the server reversed the string and calls all subscribed client callbacks.
Now, here comes the question: If I want to implement a system where all clients "register" with the server, and the server can "ask" the clients if they are still alive, would you implement this with callbacks (so instead of this "stringcallback" a kind of TellTheClientThatIAmStillHereCallback). By checking the communication state on the callback I can also "know" if a client is dead. Something similar to this:
Subscribers.ForEach(delegate(IStringCallback callback)
{
if (((ICommunicationObject)callback).State == CommunicationState.Opened)
{
callback.StringCallbackFunction(new string(retVal));
}
else
{
Subscribers.Remove(callback);
}
});
My problem, put in another way:
The server might have 3 clients
Client A dies (I pull the plug of the laptop)
The server dies and comes back online
A new client comes up
So basically, would you use callbacks to verify the "still living state" of clients, or would you use polling and keep track "how long I havent heard of a client"...
You can detect most changes to the connection state via the Closed, Closing, and Faulted events of ICommunicationObject. You can hook them at the same time that you set up the callback. This is definitely better than polling.
IIRC, the Faulted event will only fire after you actually try to use the callback (unsuccessfully). So if the Client just disappears - for example, a hard reboot or power-off - then you won't be notified right away. But do you need to be? And if so, why?
A WCF callback might fail at any time, and you always need to keep this in the back of your mind. Even if both the client and server are fine, you might still end up with a faulted channel due to an exception or a network outage. Or maybe the client went offline sometime between your last poll and your current operation. The point is, as long as you code your callback operations defensively (which is good practice anyway), then hooking the events above is usually enough for most designs. If an error occurs for any reason - including a client failing to respond - the Faulted event will kick in and run your cleanup code.
This is what I would refer to as the passive/lazy approach and requires less coding and network chatter than polling or keep-alive approaches.
If you enable reliable sessions, WCF internally maintains a keep-alive control mechanism. It regularly checks, via hidden infrastructure test messages, if the other end is still there. The time interval of these checks can be influenced via the ReliableSession.InactivityTimeout property. If you set the property to, say, 20 seconds, then the ICommunicationObject.Faulted event will be raised about 20 to 30 (maximum) seconds after a service breakdown has occurred on the other side.
If you want to be sure that client applications always remain "auto-connected", even after temporary service breakdowns, you may want to use a worker thread (from the thread pool) that repeatedly tries to create a new proxy instance on the client side, and calls a session-initiating operation, after the Faulted event has been raised there.
As a second approach, since you are implementing a worker thread mechanism anyway, you might also ignore the Faulted event and let the worker thread loop during the whole lifetime of the client application. You let the thread repeatedly check the proxy state, and try to do its repair work whenever the state is faulted.
Using the first or the second approach, you can implement a service bus architecture (mediator pattern), guaranteeing that all client application instances are constantly ready to receive "spontaneous" service messages whenever the service is running.
Of course, this only works if the reliable session "as such" is configured correctly to begin with (using a session-capable binding, and applying the ServiceContractAttribute.SessionMode, ServiceBehaviorAttribute.InstanceContextMode, OperationContractAttribute.IsInitiating, and OperationContractAttribute.IsTerminating properties in meaningful ways).
I had a similar situation using WCF and callbacks. I did not want to use polling, but I was using a "reilable" protocol, so if a client died, then it would hang the server until it timed out and crashed.
I do not know if this is the most correct or elegant solution, but what I did was create a class in the service to represent the client proxy. Each instance of this class contained a reference to the client proxy, and would execute the callback function whenever the server set the "message" property of the class. By doing this, when a client disconnected, the individual wrapper class would get the timeout excetpion, and remove itself from the server's list of listeners, but the service would not have to wait for it. This doesn't actually answer your question about determining if the client is alive, but it is another way of structuring the service to addrss the issue. If you needed to know when a client died, you would be able to pick up when the client wrapper removed itself from the listener list.
I have not tried to use WCF callbacks over the wire but i have used them for interprocess communication. I was having a problem where call of the calls that were being sent were ending up on the same thread and making the service dead lock when there were calls that were dependant on the same thread.
This may apply to the problem that you are currently have so here is what I had to do to fix the problem.
Put this attribute onto the server and client of the WCF server implemetation class
[ServiceBehavior(ConcurrencyMode = ConcurrencyMode.Multiple)]
public class WCFServerClass
The ConcurrencyMode.Multiple makes each call process on its own thread which should help you with the server locking up when a client dies until it timesout.
I also made sure to use a Thread Pool on the client side to make sure that there were no threading issues on the client side