How to configure MassTransit to retry context.Publish() in case of failure? - c#

How to configure MassTransit to retry context.Publish() before failing, for example when RabbitMQ server is temporary unavailable?

The problem with retry in this context is that the only real reason a Publish call would fail is if the broker connection was lost (for any reason: network, etc.).
In that case, the connection which was used to receive the message is also lost, meaning that another node connected to the broker may have already picked up the message. So a retry in this case would be bad, since it would reconnect to the broker and send, but then the message could not be acknowledged (since it was likely picked up on another thread/worker).
The usual course of action here is to let it fail, and when the receive endpoint reconnects, the message will be redelivered to a consumer which will then call Publish and reach the desired outcome.
You should make sure that your consumer can handle this (search for idempotent) properly to avoid a failure causing a break in your business logic.
Updated Jan 2022: Since v7, MassTransit retries all publish/send calls until the cancellationToken is canceled.

Related

Does MassTransit In-Memory Outbox work with Mediator?

Does the In Memory Outbox only work with an underlying messaging transport configured.
The documentation and some of the various posts I have read are leading to believe me that it will ONLY work with a specific underlying transport specified. It would be nice if that wasn't the case.
I say this as I have read discussion around the outbox and acknowledging messages "from a broker" and only once all processing has completed successfully - messages are acknowledged and publishing occurs.
So, when handling the messaging (i.e. via Amazon SQS) oneself and publishing messages into the state machine (i.e taking the transport message, creating a new message and then handing off to publishing to a consumer or saga state machine, how would the outbox know about and work with underlying transport messages.)
To be really clear, will the outbox work when using the following configuration (note the absence of any messaging transport configuration) :
services.AddMediator(configurator =>
{
configurator.AddConsumer<PublishMessageConsumer>();
configurator.AddSagaStateMachine<YetAnotherStateMachine, YetSomeMoreState>(
sagaConfigurator =>
{
sagaConfigurator.UseInMemoryOutbox();
}).DynamoDbRepository()
/// Snip
});
If it DOES work - if I wanted a consumer AND the Saga statemachine to work in concert such that the the Saga published to the Consumer and the Consumer failed for some reason. What would actually happen?
The sole purpose of the in-memory outbox is to defer calls to Send/Publish until after the consumer has completed. In the case of a saga, it means after the saga has been persisted to the saga repository after all state machine behaviors for the event have completed successfully (without throwing an exception).
In the case above, the saga would complete all activities for triggering event, the instance would be saved to the saga repository, and finally the consumer would be created/called by the Send/Publish call from the saga.
If the consumer throws an exception, it won't affect the already persisted saga instance in any way, as that has already completed.
NOW. If you do NOT use the in-memory outbox in this scenario, since it is using mediator (and not a transport), if you call Send/Publish in a state machine activity, control is transferred immediately to the consumer of the message sent/published. After that consumer completes, controls returns to the saga, which once the activities have completed would be persisted to the repository and the original message consumed by the saga completes, returning control to the original Send/Publish call.
Mediator is immediate, and any messages produced by consumers and/or sagas are consumed immediately as well.

NServiceBus, how do you do an action before a message is sent to the error queue?

I am using NService to create an endpoint.
The endpoint is listening to an event and do some calculation, then publish result (success or fail) to other endpoints
I know that NServiceBus support ImmediateRetry and DelayRetry, and they are configurable.
Now, I want to publish a fail result event to other endpoints after all retries (before sending to error queue).
public async Task Handle(MyEvent message, IMessageHandlerContext context)
{
Console.WriteLine($"Received MyEvent, ID = {message.Id}");
//Connect to other services to get data and do some calculation
Thread.Sleep(1000);
Console.WriteLine($"Processed MyEvent, ID = { message.Id}");
await context.Publish(new MyEventResult { IsSucceed = true });
}
Above is my current code. It will publish a successful result if there is no exception throw. But If it has a fatal exception, I don't know how to publish a fail result event before the message is sent to the error queue.
Thanks in advance.
Notes: I am using NServiceBus 6.4.3
I'm not sure why you want this but have you looked at NServiceBus sagas? They are intended to be used when having to doing blocking IO via (external) services. You can take alternative actions based on the fact if a specific task hasn't been performed within an allocated period or because the returned result was incorrect.
https://docs.particular.net/nservicebus/sagas/
See the following sample of a saga:
https://docs.particular.net/samples/saga/simple/
The following is a sample showing the usage of saga timeouts. If specific task has not been performed within a specific duration an alternative action can be performed like publishing an event or performing a ReplyToOriginator
https://docs.particular.net/nservicebus/sagas/timeouts
https://docs.particular.net/nservicebus/sagas/reply-replytooriginator-differences
https://docs.particular.net/nservicebus/sagas/#notifying-callers-of-status
By using sagas you are making your process explicit. I would avoid hooking into the recovery mechanism for this.
The recovery mechanism is meant to deal with transient errors like network connectivity issues, database deadlocks, etc. but not with expected failure results. You should properly process these and continue your modeled process in its unhappy path.

Can I set an explicit task timeout for RabbitMQ consumer?

I understand that RabbitMQ with ack, by default, will re-queue the message if it detects that the consumer/worker has died.
What about the situation where the consumer/worker is still alive but its process has stalled out for too long and didn't ack?
I would like to set an explicit time that says that if a message has been dispatched to a consumer but that consumer has held the message without ack for too long that the message gets re-queued.
I recognize that this might result in messages getting processed in duplicate but sometimes the consequence of that is not as bad as delayed message delivery.
It can also happen with errant exception handling if something get swallowed, the task terminates, and the message is never ack'd and never re-queued.
Timeout for RabbitMQ consumer could be explicitly set on the consumer side. I think this is clear but just to mention - there must not be any automatic ACKs in this case. The solution would be that the consumer is multithreaded with one thread doing message processing and ACKing the message only after it has been processed, and the other thread being a timeout thread that would:
terminate the connection to broker once the timeout has expired, and
as a consequence the message would be requeued
ACK the received message and re-publish it (explicitly)
NACK the received message, but based on the documentation (instructing the broker to either discard them or requeue them), it seems that some config should be set instructing the broker what should it do with NACKed messages
Now all this implies that at least some part of the process isn't stuck. If the whole process is stuck, perhaps the broker heartbeat towards the consumer is stopped and that is how the broker knows that the consumer died (honestly I didn't test this situation, so I'm assuming), but if this is not the case (or simply to be extra safe) you could add some kind of a watchdog process that would be pinging the consumer(s) and killing them if there's no reply, which again would lead to the messages not being ACKed and being requeued.

WMQ: Distributing MQ readers over several machines

I am using WMQ to access an IBM WebSphere MQ on a mainframe - using c#.
We are considering spreading out our service on several machines, and we then need to make sure that two services on two different machines cannot read/get the same MQ message at the same time.
My code for getting messages is this:
var connectionProperties = new Hashtable();
const string transport = MQC.TRANSPORT_MQSERIES_CLIENT;
connectionProperties.Add(MQC.TRANSPORT_PROPERTY, transport);
connectionProperties.Add(MQC.HOST_NAME_PROPERTY, mqServerIP);
connectionProperties.Add(MQC.PORT_PROPERTY, mqServerPort);
connectionProperties.Add(MQC.CHANNEL_PROPERTY, mqChannelName);
_mqManager = new MQQueueManager(mqManagerName, connectionProperties);
var queue = _mqManager.AccessQueue(_queueName, MQC.MQOO_INPUT_SHARED + MQC.MQOO_FAIL_IF_QUIESCING);
var queueMessage = new MQMessage {Format = MQC.MQFMT_STRING};
var queueGetMessageOptions = new MQGetMessageOptions {Options = MQC.MQGMO_WAIT, WaitInterval = 2000};
queue.Get(queueMessage, queueGetMessageOptions);
queue.Close();
_mqManager.Commit();
return queueMessage.ReadString(queueMessage.MessageLength);
Is WebSphere MQ transactional by default, or is there something I need to change in my configuration to enable this?
Or - do I need to ask our mainframe guys to do some of their magic?
Thx
Unless you actively BROWSE the message (ie read it but leave it there with no locks), only one getter will ever be able to 'get' the message. Even without transactionality, MQ will still only deliver the message once... but once delivered its gone
MQ is not transactional 'by default' - you need to get with GMO_SYNCPOINT (MQ transactions) and commit at the connection (MQQueueManager level) if you want transactionality (or integrate with .net transactions is another option)
If you use syncpoint then one getter will get the message, the other will ignore it, but if you subsequently have an issue and rollback, then it is made available to any getter (as you would want). It is this scenario where you might see a message twice, but thats because you aborted the transaction and hence asked for it to be put back to how it was before the get.
I wish I'd found this sooner because the accepted answer is incomplete. MQ provides once and only once delivery of messages as described in the other answer and IBM's documentation. If you have many clients listening on the same queue, MQ will deliver only one copy of the message. This is uncontested.
That said, MQ, or any other async messaging for that matter, must deal with session handling and ambiguous outcomes. The affect of these factors is such that any async messaging application should be designed to gracefully handle dupe messages.
Consider an application putting a message onto a queue. If the PUT call receives a 2009 Connection Broken response, it is unclear whether the connection failed before or after the channel agent received and acted on the API call. The application, having no way to tell the difference, must put the message again to assure it is received. Doing the PUT under syncpoint can result in a 2009 on the COMMIT (or equivalent return code in messaging transports other than MQ) and the app doesn't know if the COMMIT was successful or if the PUT will eventually be rolled back. To be safe it must PUT the message again.
Now consider the partner application receiving the messages. A GET issued outside of syncpoint that reaches the channel agent will permanently remove the message from the queue, even if the channel agent cannot then deliver it. So use of transacted sessions ensures that the message is not lost. But suppose that the message has been received and processed and the COMMIT returns a 2009 Connection Broken. The app has no way to know whether the message was removed during the COMMIT or will be rolled back and delivered again. At the very least the app can avoid losing messages by using transacted sessions to retrieve them, but can not guarantee to never receive a dupe.
This is of course endemic to all async messaging, not just MQ, which is why the JMS specification directly address it. The situation is addressed in all versions but in the JMS 1.1 spec look in section 4.4.13 Duplicate Production of Messages which states:
If a failure occurs between the time a client commits its work on a
Session and the commit method returns, the client cannot determine if
the transaction was committed or rolled back. The same ambiguity
exists when a failure occurs between the non-transactional send of a
PERSISTENT message and the return from the sending method.
It is up to a JMS application to deal with this ambiguity. In some
cases, this may cause a client to produce functionally duplicate
messages.
A message that is redelivered due to session recovery is not
considered a duplicate message.
If it is critical that the application receive one and only one copy of the message, use 2-Phase transactions. The transaction manager and XA protocol will provide very strong (but still not absolute) assurance that only one copy of the message will be processed by the application.
The behavior of the messaging transport in delivering one and only one copy of a given message is a measure of the reliability of the transport. By contrast, the behavior of an application which relies on receipt of one and only one copy of the message is a measure of the reliability of the application.
Any duplicate messages received from an IBM MQ transport are almost certainly going to be due to the application's failure to use XA to account for the ambiguous outcomes inherent in async messaging and not a defect in MQ. Please keep this in mind when the Production version of the application chokes on its first duplicate message.
On a related note, if Disaster Recovery is involved, the app must also gracefully recover from lost messages, or else find a way to violate the laws of relativity.

nservicebus + webhooks +Errors +MaxRetries

Feature Description
The NServiceBus gateway, http://docs.particular.net/nservicebus/gateway/, seems to be a way to achieve an internal webhook using the NServiceBus infrastructure.
We need to go further with this concept to open up a few event to any 3rd party subscriber that has access to register a webhook url in our system.
Review
We plan to create two initial window services
1) WebHookBatchService, that can be added as a subscriber to specific messages of interest.
<UnicastBusConfig>
<MessageEndpointMappings>
.......
<add Messages="MyMessages.MyImportantMessage, MyMessages" Endpoint="WebHookBatchService.Queue"/>
.......
</MessageEndpointMappings>
</UnicastBusConfig>
2) WebHookProcessService - actually processes 1 message sent by the WebHookBatchService.
Once messages are received on the WebHookBatchService.Queue our WebHookBatchService will look up all the subscribers for the specific tenant + message type and foreach send individual messages to WebHookProcessService.Queue for the WebHookProcessService (which we can make an instance of nservicebus loadbalancer to bridge the batch and actual processor) to actually process the real messages probably using http://restsharp.org/.
Questions
Are there any existing open source projects that do this today?
Now since we have no control of the durability of the subscribers how should we manage errors?
http://wiki.shopify.com/WebHook
A webhook will be deleted if there are 19 consecutive failures for the exact same webhook.
It doesn't mention any delays in the webhook.. What have people experienced with standard delay in retry logic?
Here are some other thoughts:
proposal 0: MaxRetries="1". Purge WebHookProcessService.ErrorQueue nightly. (no retry - guaranteed message loss if it fails the first time)
proposal 1:
MaxRetries="1" on exception catch send email containing xml version of the message that would have been delivered over http.
Purge WebHookProcessService.ErrorQueue nightly.
-- I see potential a spam issues.
proposal 2: The nservicebus MaxRetries retries right away without delay. So i would need to create (1hr - 24hr) bucket queues and use a RetrySchedulerService although I see this as difficult to maintain and confusing for subscribers when they all at once get 25 messages in a non DateCreated ordered fashion when there service endpoint begins to work.
Digging for ideas...
The Gateway is typically used for communication between physical sites over HTTP. Since you are exposing an endpoint to the world to accept callbacks, I'm thinking you could just use the built-in WCF hosting and expose your endpoint through the firewall to 3rd parties. The rest of your setup sounds appropriate to me.
As for errors, you are correct, NSB retries immediately, but if you using web call backs this may get you by in the cases there are small hiccups. You will need to determine how you want to process the error queues, we just build in a new endpoint to process the error queues with logic to determine the retries, delay etc. A nice way to accomplish this is to use a Saga, which includes a Timeout manager. This enables a workflow where you can retry a specified number of times, try another communication, log everything, and ultimately notify someone who can contact the 3rd party to let them know there stuff is busted.

Categories