We have recently decided to start using on-premise Service Fabric and have encountered a 'dependency' problem.
We have several guest executables which have dependencies between them, and can't recover from a restart of the service they are dependant on without a restart themselves.
An example to make it clear:
In the chart below service B is dependant on service A.
If service A encounters an unexpected error and gets restarted, service B will go into an 'error' state (which won't be reported to the fabric). This means service B will report an OK health state although it's in an error state.
We were thinking of a solution around these lines:
Raise an independent service which monitors the health state events of all replicas/partitions/applications in the cluster and contains the entire dependency tree.
When the health state of a service changes, it restarts its direct dependencies, which will cause a domino effect of events -> restarts untill the entire subtree has been reset (as shown in the Event-> Action flow chart bellow).
The problem is the healthReport events don't get sent within short intervals of time (meaning my entire system could not work and I wouldn't know for a few a minutes). I would monitor the health state, but I do need to know history (even if the state is healthy now, it doesn't mean it wasn't in error state earlier).
Another problem is that the events could pop at any service level (replica/partition), and it would require me to aggregate all the events.
I would really appreciate any help on the matter. I am also completely open to any other suggestion for this problem, even if it's in a completely other direction.
Cascading failures in services can generally be avoided by introducing fault tolerance at the communication boundaries between services. A few strategies to achieve this:
Introduce retries for failed operations with a delay in between. The time between delays may grow exponentially. This is an easy option to implement if you are currently doing a lot of remote procedure call (RPC) style communication between services. It may be very effective if your dependent services don't take too long to restart. Polly is a well-known library for implementing retries.
Use circuit breakers to close down communications with failing services. In this metaphor, a closed circuit is formed between two services communicating normally. The circuit breaker monitors the communications. If it detects some number of failed communications, it 'opens' the circuit, causing any further communications to fail immediately. The circuit breaker then sends periodic queries to the failing service to check its health, and closes the circuit once the failing service becomes operational. This is a little more involved than retry policies since you are responsible for preventing an open circuit from crashing your service, and also for deciding what constitutes a healthy service. Polly also supports circuit breakers
Use queues to form fully asynchronous communication between services. Instead of communicating directly from service B to A, queue outbound operations to A in service B. Process the queue in its own thread - do not allow communication failures to escape the queue processor. You may also add an inbound queue to service A to receive messages from service B's outbound queue to completely isolate message processing from the network. This is probably the most durable but also the most complex as it requires a very different architecture from RPC, and you must also decide how to deal with messages which fail repeatedly. You might retry failed messages immediately, send them to the back of the queue after a delay, send them to a dead letter collection for manual processing, or drop the message altogether. Since you're using guest executables you don't have the luxury of reliable collections to help with this process, so a third party solution like RabbitMQ might be useful if you decide to go this way.
Related
I'm currently investigating this but thought I'd ask anyway. Will post an answer once I find out if not answered.
The problem is as follows:
An application calls RabbitHutch.CreateBus to create an instance of IBus/IAdvancedBus to publish messages to RabbitMQ. The instance is returned but the IsConnected flag is set to false (i.e. connection retry is done in the background). When the application serves a specific request, IAdvancedBus.PublishAsync is called to publish a message while the bus still isn't connected. Under significant load, requests to the application end up timing out as the bus was never able to connect to RabbitMQ.
Same behaviour is observed when connectivity to RabbitMQ is lost while processing requests.
The question is:
How is EasyNetQ handling attempts to publish messages while the bus is disconnected?
Are messages queued in memory until the connection can be established? If so, is it disposing of messages after it reaches some limit? Is this configurable?
Or is it forcing the bus to try to connect to RabbitMQ?
Or is it dumping the message altogether?
Is PublisherConfirms switched on impacting the behavior?
I haven't been able to test all scenarios described above, but it looks like before trying to publish to RabbitMQ, EasyNetQ is checking that the bus is connected. If it isn't, it is entering a connection loop more or less as described here: https://github.com/EasyNetQ/EasyNetQ/wiki/Error-Conditions#there-is-a-network-failure-between-my-subscriber-and-the-rabbitmq-broker
As we are increasing load, it looks as if connection loops are spiralling out of control as none of them ever manage to connect to RabbitMQ because our infrastructure or configuration is broken. Why are we getting timeouts I have not identified yet but I suspect that there could be a concurrency issue going on when several connection loops attempt to connect simultaneously.
I also doubt that switching off PublisherConfirms would help at all as we are not able to publish messages and therefore not waiting for acknowledgement from RabbitMQ.
Our solution:
So why have I not got a clear answer to this question? The truth is, at this point in time the messages that we are trying to publish are not mission critical, strictly speaking. If our configuration is wrong, deployment will fail when running a health check and we'll essentially abort the deployment. If RabbitMQ becomes unavailable for some reason, we are OK with not having these messages published.
Also, to avoid timing out, we're wrapping up message publishing with a circuit breaker to stop message publishing if we detect that the circuit between our application and RabbitMQ is opened. Roughly speaking, this is working as follows:
var bus = RabbitHutch.Create(...).Advanced;
var rabbitMqCircuitBreaker = new CircuitBreaker(...);
rabbitMqCircuitBreaker.AttemptCall(() => {
if (!bus.IsConnected)
throw new Exception(...);
bus.Publish(...);
});
Notice that we are notifying our circuit breaker that there is a problem when the IsConnected flag is set to false by throwing an exception. If the exception is thrown X number of times over a configured period of time, the circuit will open and we will stop trying to publish messages for a configured amount of time. We think that this is acceptable as the connection should be really quick and available 99.xxx% of the time if RabbitMQ is available. Also worth noting that the bus is created when our application is starting up, not before each call, therefore the likelihood of checking the flag before it is actually set in a valid scenario is pretty low.
Works for us at the moment, any additional information would be appreciated.
I have a windows service written in C# that reads from MSMQ and based on the type of the message it assigns them to Agents that process that message in a worker thread. The application starts with no agents and are created dynamically at runtime as messages arrive in the MSMQ
Here is a basic figure of how it works:
If the agent worker thread is busy doing work the message is queued to its local queue. So far so good. But if for some reason if the service is stopped, the local queue content is lost.
I am trying to figure out what could be the best way to handle this scenario. Right now the local queues are a System.Concurrent.ConcurrentQueue. I could probably use a Sql Ce db or some other persistent storage, but i am worried about performance. The other thing in my mind is to read from MSMQ only when agents are ready to process message, but the problem is that I don't know what message the MSMQ will contain.
What possible approaches can I take on this issue?
Your design is basically implements the following pattern: http://www.eaipatterns.com/MessageDispatcher.html
However, rather than using actual messaging you are choosing to implement the dispatcher in multithreaded code.
Rather, each processing agent should be an autonomous process with it's own physical message queue. This is what will provide message durability in case of failure. It also allows you to scale simply by hosting more instances of the processing agent.
I have built a similar system dependent on Redis. The idea is that it provides memory - fast data access isolated from the rest of the application, and will not shut down when my service does. Furthermore, it will eventually persist my data to the disk, so I get a good compromise between reliability and speed.
If you designed it so that each client read from its own message queue that would be hosted in Redis, you could keep the queue independent from the service's downtime, and each worker's load apportioned when you next start the service.
Why don't you simply create two new msms queues to receive the messages for Agenta and agentb, and create a new agent that ( transactionally ) fetch the command from the main queue and dispatch the message to the proper agent queue ?
I have just finished creating an API where the requests from the API are forwarded to a back-end service via MassTransit/RabbitMQ using the Request/Response pattern. We are now looking at pushing this into production, and are wanting to have multiple instances of the application (both API and service) running on different services, with a load balancer distributing the requests between them.
This leaves us in a position where we could potentially lose all of the messages if one of the servers is taken out of the pool for any reason. I am looking at creating a RabbitMQ cluster between the servers (each server has a local install) and was wondering how I would go about setting up the competing consumers in this instance.
Does RabbitMQ or MassTransit handle this so that only one consumer will receive the request, or will all consumers receive it and attempt to respond? Also, with the RabbitMQ cluster, how will MassTransit/RabbitMQ handle a node failing?
You should take a look at this document.
http://www.rabbitmq.com/distributed.html
Explains the common distributed scenarios quite nicely. For your scenario I think federation would be a better fit than clustering. If you go for clustering you should look at mirrored queues.
If all you need is performance you are better of getting a single server to handle your message queuing and the other server will connect to it and produce/consume messages.
I don't know how Mass Transit works but, if Request/Response is used you should get a single delivery of message to a single consumer, if the message is not ack-ed (the consumer crashes) an other consumer should pick it up.
Looking for some ideas/pattern to solve a design problem for a system I will be starting work on soon. There is no question that I will need to use some sort of messaging (probably MSMQ) to communicate between certain areas of the system. I don't want to reinvent the wheel, but at the same time I want to make sure I am using the right tool for the job. I have been tinkering with and reading up on NServiceBus, and I'm very impressed with what it does--however I'm not sure it's intended for what I'm trying to achieve.
Here is a (hopefully) very simple and conceptual description of what the system needs to do:
I have a service that clients can send messages to. The service is "Fire and Forget"--the most the client would get back is something that may say success or failure (success being that the message was received).
The handling/processing of each message is non-trivial, and may take up significant system resources. For this reason only X messages can be handled concurrently, where X is a configurable value (based on system specs, etc.). Incoming messages will be stored in queue until it's "their turn" to be handled.
For each client, messages must be handled in order (FIFO). However, some clients may send many messages in succession (thousands or more), for example if they lost connectivity for a period of time. For this reason, messages must be handled in a round-robin fashion across clients--no client is allowed to gorge and no client is allowed to starve. So the system will either have to be able to query the queue for a specific client, or create separate queues per client (automatically, since the clients won't be known at compile time) and pull from them in rotation.
My current thinking is that I really just need to use vanilla MSMQ, create a service to accept messages and write them to one or more queues, then create a process to read messages from the queue(s) and handle/process them. However, the reliability, auditing, scaleability, and ease of configuration you get with something like NServicebus looks very appealing.
Is an ESB the wrong tool for the job? Is there some other technology or pattern I should be looking at?
Update
A few clarifications.
Regarding processing messages "in order"--in the context of a single client, the messages absolutely need to be processed in the order they are received. It's complicated to explain the exact reasons why, but this is a firm requirement. I neglected to mention that only one message per client would ever be processed concurrently. So even if there were 10 worker threads and only one client had messages waiting to be processed, only one of those messages would be processed at a time--there would be no worry of a race condition.
I believe this is generally possible with vanilla MSMQ--that you can have a list of messages in a queue and always take the oldest one first.
I also wanted to clarify a use case for the round robin ordering. In this example, I have two clients (A and B) who send messages, and only one worker thread. All queues are empty. Client A has lost connectivity overnight, so at 8am sends 1000 messages to the service. These messages get queued up and the worker thread takes the oldest one and starts processing it. As this first message is being processed, client B sends a message into the service, which gets queued up (as noted, probably in a separate queue). When Client A's first message completes processing, the logic should check whether client B has a message (it's client B's "turn"), and since it finds one, process it next.
If client B hadn't sent a message during that time, the worker would continue processing client A's messages one at a time, always checking after processing to see if other client queues contained waiting messages to ensure that no client was being starved.
Where I still feel there may be a mismatch between an ESB and this problem is that an ESB is designed to facilitate communication between services; what I am trying to achieve is a combination of messaging/communication and a selective queuing system.
So the system will either have to be
able to query the queue for a specific client,
Searching through an MSMQ queue for a message from a particular client using cursors can be inefficient and doesn't scale.
or create separate queues per client (automatically, since the
clients won't be known at compile time) and pull from them in rotation.
MSMQ cannot create queues automatically. All messages have to be sent to a known queue first. Your own custom dispatcher service, though, could then create new queues on demand and put copies of the messages in them.
[[I avoid saying "move" messages as you can't do that with application code; you can only read a message and create a new message using the original data. This distinction is important when you are using Source Journaling, for example.]]
Cheers
John Breakwell
Using an ESB like NServiceBus seems like a good solution to your problem. But based on your conceptual description, there's some things to consider. Let's go through your requirements step-by-step, using NServiceBus as a possible ESB solution:
I have a service that clients can send messages to. The service is "Fire and Forget"--the most the client would get back is something that may say success or failure (success being that the message was received).
This is easily done with NServiceBus. You can Bus.Send(Message) from the client. If your client requires an answer, you can use Bus.Return(ErrorCode). You mention that "success being that the message was received". If you use an ESB like NServiceBus, it's up to the messaging platform the deliver the message. So, if your Bus.Send doesn't throw an exception, you can be sure that the message has been sent properly. Because of this you don't probably have to send success / failure messages back to the client.
The handling/processing of each message is non-trivial, and may take up significant system resources. For this reason only X messages can be handled concurrently, where X is a configurable value (based on system specs, etc.). Incoming messages will be stored in queue until it's "their turn" to be handled.
When using NServiceBus, you can configure the the number of worker threads by setting the "NumberOfWorkerThreads" option. If your server has multiple cores / cpus, you can use this setting to balance the work load.
For each client, messages must be handled in order (FIFO).
This is something that may cause problems depending on your requirements. ESBs in general don't promise to process the messages in-order, if they have many threads working on the messages. In a case of NServiceBus, you can send an array of messages from the client into the bus and these will be processed in-order. Also, you can solve some of the in-order messaging problems by using Sagas.
However, some clients may send many messages in succession (thousands or more), for example if they lost connectivity for a period of time
When using an ESB solution, your server doesn't have to be up for the client to work. Clients can still send messages and the server will start processing them as soon as it's back online. Here's a small introduction on this.
For this reason, messages must be handled in a round-robin fashion across clients--no client is allowed to gorge and no client is allowed to starve.
This isn't a problem because you've decided to use messages :)
So the system will either have to be able to query the queue for a specific client, or create separate queues per client (automatically, since the clients won't be known at compile time) and pull from them in rotation.
Could you expand on this? I'm not sure of your design on this one.
I'm programming a monitoring application that needs to display the state of several windows services. In the current version, I can know whether a service is Running, Stopped, Suspended or in one of the pending states. That's good, but I'm wondering if there is a way to test if a service is actually responding? I guess it can be in a running state but not responding at all!
I am using the ServiceController class from System.ServiceProcess. Do you think that if a service is not responding, the ServiceController.Status would return an exception?
How would you approach the problem?
Thanks
EDIT
Seems that: ServiceController.Status can return 2 types of exceptions:
System.ComponentModel.Win32Exception: An error occurred when accessing a system API.
System.InvalidOperationException: The service does not exist as an installed service.
Nothing about reactivity.
This might be obvious, but have you tried talking to the service?
There's no common way to talk to a service, so there is no way Windows can interrogate whether the service is still responding as normal. It is perfectly normal for a service to go into a complete sleep waiting for external I/O to happen, and thus Windows would not get a response while the service is actually alive and functioning exactly as designed.
The only way is to actually send a request to it, and wait for the response, and for that you need some inter-process communication channel, like:
Network
Named pipes
Messages
Basically, if you need to determine if a service is able to respond, you need to check if it is responding.
The service controller types and APIs can only provide information on the basis of the service's response to those APIs.
E.g. you can create a service which responds to those APIs correctly, but provides no functionality on even numbered hours.
In the end you need to define "responsive" in terms of the services functionality (e.g. a batch processor is processing batches) and provide a mechanism (A2A API, WMI, Performance Counters) to surface this.