Kafka - C# - confluent-kafka-dotnet - Message time out - c#

We have a simple standalone mode deployment of Kafka 1.1.0 on our Linux machine. In a server.properties we have modified:
listeners = PLAINTEXT://10.0.5.66:9092
advertised.listeners is commented out so it shall fall back to the default value found in listeners property.
We are using a .NET (C#) producer which is pushing messages through confluent-kafka-dotnet (0.11.4).
Sometimes the message gets transferred to Kafka and sometimes we receive "Message time out" error on the producer side.
We are running out of ideas of what might cause this issue. It happens from time to time. If one message fails, another which comes few seconds after the first one usually passes.
Another trace could be that from time to time we see the following message in the Kafka logs on the server: WARN: Attempting to send a response via a channel for which there is no open connection <IP:PORT>. This message sometimes contains the IP Address and port of the producer.
Any idea of what might be wrong?

Related

The format of the specified network name is not valid : HTTPListener error on system restart

I have implemented an Http Listener in c# as a windows service. The windows service is set to start automatically when the machine is restarted. When I manually start the service after installing it, the http listener works fine and it responds to the requests it receives. But, when the service is started on a system restart, I get the following error:
System.Net.HttpListenerException (0x80004005): The format of the specified network name is not valid
I get this error on listener.Start().
The code of http listener is like this:
HttpListener listener = new HttpListener();
listener.Prefixes.Add("http://myip:port/");
listener.Start();
I got a suggestion from this already asked question. If I follow what's given in the answer, it still doesn't work.
Furthermore, I tried running:
netsh http show iplisten
in powershell, the list is empty. Even when the http listener works (when the first time I install the service and run it), the output of this command is empty list. So I don't think this is an issue.
Any suggestions will be really helpful.
Answering my own question. It seems there are some other services that need to be running for us to be able to start an http listener. These are not yet started by the time windows starts my service. I found two solutions for this, one is to use delayed start
sc.exe config myservicename start=delayed-auto
The other is to have a try catch while starting the http listener, and if it fails, try again after a few seconds. In my case, time is of the essence so I'm using the second approach because it start the listener about 2 minutes faster than the first approach.

Detect Lost Connection In Publish And Change Host RabbitMQ

I develop a very simple app using RabbitMQ. One machine, multiple queues and exchanges, one publisher and one consumer. After reading further about Clustering and HA I connect a second machine to create a cluster, besides I mirrored queues to have at least one replica. Now when I want to publish some data into a queue, I use the first machine as my host and it works fine, but if RabbitMQ service of the first machine not running my app crashed. My question is how to know which machine is up for creating connection and how to change the host while publishing messages?
UPDATEI use one of CreateConnection overloads to pass all my hosts for creating a connection. OK, this will solve the problem of finding an available machine to create a connection. But the second question is still there, look at the code below:
for(int i = 0, i < 300, i++){
var message = string.Format("Message #{0}: {1}", i, Guid.NewGuid());
var messageBodyTypes = Encoding.UTF8.GetBytes(message);
channel.BasicPublish(ExchangeName, "123456", null, messageBodyBytes);
}
These lines of code is work perfect when the connection is OK, but assume that in the middle of publishing messages to an exchange, the service stopped unexpectedly, then in this case first System.IO.FileLoadException raised and if I continue the executation RabbitMQ.Client.Exceptions.AlreadyClosedException raised which is saying:
Already closed: The AMQP operation was interrupted: AMQP close-reason, initiated by Peer, code=320, text="CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'", classId=0, methidId=0, cause=
I think there must be a way to change the host when the connection closed during publishing messages, but how, no IDEA!
These lines of code is work perfect when the connection is OK, but
assume that in the middle of publishing messages to an exchange, the
service stopped unexpectedly, then in this case first
System.IO.FileLoadException raised and if I continue the executation
RabbitMQ.Client.Exceptions.AlreadyClosedException raised which is
saying:
You must close the channel and current connection and open a new one of each mid-loop. That should use a different connection. You only have to do this when the exception is caught, not on every iteration of the loop.
NOTE: the RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.

SignalR Groups.Add times out and fails

I'm trying to add a member to a Group using SignalR 2.2. Every single time, I hit a 30 second timeout and get a "System.Threading.Tasks.TaskCanceledException: A task was canceled." error.
From a GroupSubscriptionController that I've written, I'm calling:
var hubContext = GlobalHost.ConnectionManager.GetHubContext<ProjectHub>();
await hubContext.Groups.Add(connectionId, groupName);
I've found this issue where people are periodically encountering this, but it happens to me every single time. I'm running the backend (ASP.NET 4.5) on one VS2015 launched localhost port, and the frontend (AngularJS SPA) on another VS 2015 launched localhost port.
I had gotten SignalR working to the point where messages were being broadcast to every connected client. It seemed so easy. Now, adding in the Groups part (so that people only get select messages from the server) has me pulling my hair out...
That task cancellation error could be being thrown because the connectionId can't be found in the SignalR registry of connected clients.
How are you getting this connectionId? You have multiple servers/ports going - is it possible that you're getting your wires crossed?
I know there is an accepted answer to this, but I came across this once for a different reason.
First off, do you know what Groups.Add does?
I had expected Groups.Add's task to complete almost immediately every time, but not so. Groups.Add returns a task that only completes, when the client (i.e. Javascript) acknowledges that it has been added to a group - this is useful for reconnecting so it can resubscribe to all its old groups. Note this acknowledgement is not visible to the developer code and nicely covered up for you.
The problem is that the client may not respond because they have disconnected (i.e. they've navigated to another page). This will mean that the await call will have to wait until the connection has disconnected (default timeout 30 seconds) before giving up by throwing a TaskCanceledException.
See http://www.asp.net/signalr/overview/guide-to-the-api/working-with-groups for more detail on groups

MSMQ: Messages occasionally not being sent or received without error

I have the following setup and problem with MSMQ. Based on previous experience with MSMQ I'm betting that it is something simple I'm missing but I just don't know what it is.
The Setup
I have 3 load-balanced web servers (lets call them Servers W1, W2 and W3) and 1 server which processes certain events/data away from web requests (which I'll call P). All 3 of the web servers, once a particular event occurs within the web application, will send a message to a remote private queue on Server P, which will then process each message from the queue and carry out some task.
The Problem
For the most part - at a guess 95% of the time - everything runs fine, but occasionally Server P does not receive messages from the web servers. This is either because W1, W2 or W3 are not sending them or they are not being received by P, I just can't tell. This means I'm missing vital events happening from the users on the web application but I cannot find any errors listed in my own logs.
The Details
Here are all the details I can think of which may help explain my setup and what I've figured out so far:
The private queue on Server P is non-transactional.
The private queue has permissions setup for Everyone to both Send and Receive Messages.
This is the code I use (C#) to send the message to the remote private queue:
var queue = new MessageQueue(#"FormatName:DIRECT=OS:ServerP\PRIVATE$\MyMessageQueue");
var defaultProperties = queue.DefaultPropertiesToSend;
defaultProperties.AcknowledgeType = AcknowledgeTypes.FullReachQueue | AcknowledgeTypes.FullReceive;
defaultProperties.Recoverable = true;
defaultProperties.UseDeadLetterQueue = true;
defaultProperties.UseJournalQueue = true;
queue.Send(requestData);
Sending the message using the code above does not appear to throw an exception - if it did my error handler in the web application would have caught and logged it, so I'm assuming it is sent.
There are outgoing queues on W1, W2 and W3 all pointing to the private queue on P - all these are empty.
On W1, W2 and W3 I cannot see any "dead-letter" messages.
On P the private queue is empty so messages are being processed (which I can verify from my database).
On P there are no "dead-letter" messages. There are journal messages but they don't seem to correspond to any recent date/times.
All servers are running Windows Server 2012.
Most of the time messages are sent, received and processed just fine but, without any pattern visible to me, sometimes they are not. Can anyone see what is going wrong? Or explain to me how I can try and figure out what is happening?
Are you sure that the receiver on P does not crash/lose the message somehow? Because your queue is not transactional, if somehow processing fails then that's one lost message.
Anyway, there are many possible causes why this could fail.
What kind of logging do you have (DEBUG/INFO levels)?
I think the following will help tracking down the issue:
When an event is generated in the web app.
Right before you send an event from the web app, via MSMQ.
In the receiver when you get a message from the queue.
This way you could at least match sent messages to received messages and to processed messages.
As a side note, when you check for dead-letter messages you do so on the source computer and on any intermediary hops, not on the destination one. If you don't have any hops, then they will be relayed to the non-transactional dead-letter queue on the web servers.

Rebus stops retrieving messages from RabbitMQ

We have an issue in our Rebus/RabbitMQ setup where Rebus suddenly stops retrieving/handling messages from RabbitMQ. This has happened two times in the last month and we're not really sure how to proceed.
Our RabbitMQ setup has two nodes on different servers, and the Rebus side is a windows service.
We see no errors in Rebus or in the eventlog on the server where Rebus runs. We also do not see errors on the RabbitMQ servers.
Rebus (and the windows service) keeps running as we do see other log messages, like the DueTimeOutSchedular and timeoutreplies. However it seems the worker thread stops running, but without any errors being logged.
It results in a RabbitMQ input queue that keeps growing :(, we're adding logging to monitor this so we get notified if it happens again.
But I'm looking for advise on how to continue the "investigation" and ideas on how to prevent this. Maybe some of you have experienced this before?
UPDATE
It seems that we actually did have a node crashing, at least the last time it happened. The master RabbitMQ node crashed (the server crashed) and the slave was promoted to master. As far as I can see from the RabbitMQ logs on the nodes everything went according to planned. There are no other errors in the RabbitMQ logs.
At the time this happened Rebus was configured to connect only to the node that was the slave (then promoted to master) so Rebus did not experience the rabbitmq failure and thus no Rebus connection errors. However, it seems that Rebus stopped handling messages when the failure occurred.
We are actually experiencing this on a few queues it seems, and some of them, but not all seems to have ended up in an unsynchronized state.
UPDATE 2
I was able to reproduce the problem quite easily, so it might be a configuration issue in our setup. But this is what we do to reproduce it
Start two nodes in a cluster, ex. rabbit1 (master) and rabbit2 (slave)
Rebus connects to rabbit2, the slave
Close rabbit1, the master. rabbit2 is promoted to master
The queues are mirrored
We have two small tests apps to reproduce this, a "sender" that sends a message every second and a "consumer" that handles the messages.
When rabbit1 is closed, the "consumer" stops handling messages, but the "sender" keeps sending the messages and the queue keeps growing.
Start rabbit1 again, it joins as slave
This has no effect and the "consumer" still does not handle messages.
Restart the "consumer" app
When the "consumer" is restarted it retrieves all the messages and handles them.
I think I have followed the setup guides correctly, but it might be a configuration issue on our part. I can't seem to find anything that would suggest what we have done wrong.
Rebus is still connected to RabbitMQ, we see that in the connections tab on the management site, the "consumers" send/recieved B/s drop to about 2 B/s when it stops handling messages
UPDATE 3
Ok so I downloaded the Rebus source and attached to our process so I could see what happens in the "RabbitMqMessageQueue" class when it stops. When "rabbit1* is closed the "BasicDeliverEventArgs" is null, this is the code
BasicDeliverEventArgs ea;
if (!threadBoundSubscription.Next((int)BackoffTime.TotalMilliseconds, out ea))
{
return null;
}
// wtf??
if (ea == null)
{
return null;
}
See: https://github.com/rebus-org/Rebus/blob/master/src/Rebus.RabbitMQ/RabbitMqMessageQueue.cs#L178
I like the "wtf ??" comment :)
That sounds very weird!
Whenever Rebus' RabbitMQ transport experiences an error on the connection, it will throw out the connection, wait a few seconds, and ensure that the connection is re-established again when it can.
You can see the relevant place in the source here: https://github.com/rebus-org/Rebus/blob/master/src/Rebus.RabbitMQ/RabbitMqMessageQueue.cs#L205
So I guess the question is whether the RabbitMQ client library can somehow enter a faulted state, silently, without throwing an exception when Rebus attemps to get the next message...?
When you experienced the error, did you check out the 'connections' tab in RabbitMQ management UI and see if the client was still connected?
Update:
Thanks for you thorough investigation :)
The "wtf??" is in there because I once experienced a hiccup when ea had apparently been null, which was unexpected at the time, thus causing a NullReferenceException later on and the vomiting of exceptions all over my logs.
According to the docs, Next will return true and set the result to null when it reaches "end-of-stream", which is apparently what happens when the underlying model is closed.
The correct behavior in that case for Rebus would be to throw a proper exception and let the connection be re-established - I'll implement that right away!
Sit tight, I'll have a fix ready for you in a few minutes!

Categories