Transport Endpoint Not Connected - Mesos Slave/Master

Transport Endpoint Not Connected - Mesos Slave / Master

I had a similar problem.
My slave logs would be filled with

    E0812 15:58:04.017990  2193 socket.hpp:107] Shutdown failed on fd=13: Transport endpoint is not connected [107]

My master would have

    F0120 20:45:48.025610 12116 master.cpp:1083] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins

And the master would die, and a new election would occur, the killed master would be restarted by upstart (I am on a Centos 6 box) and be added into the pool of potential masters. Thus my elected master would daisy chain around my master nodes. Many restarts of masters and slaves did nothing the problem would consistently return within 1 minute of master election.

The solution for me came from a this stackoverflow question (thanks) and a hint in a github gist note.

The gist of it is /etc/default/mesos-master must specify a quorum number (it needs to be correct for the number of mesos masters, in my case 3)

    MESOS_QUORUM=2

This seems odd to me as I have the same information in the file /etc/mesos-master/quorum

But I added it to /etc/default/mesos-master restarted the mesos-masters and slaves and the problem has not returned.

I hope this helps you.

mesos slaves are not connecting with mesos masters cluster

So the problematic line is:

Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.629093  2502 slave.cpp:3215] master@127.0.0.1:5050 exited

Specifically, note it's detecting the master as having the IP address 127.0.0.1. The Mesos Agent[1] sees that IP address, and tries to connect which fails (The master isn't running on the same machine as the agent).

This happens because the master announces what it thinks it's IP address is into Zookeeper. In your case, the master is thinking it's IP is 127.0.0.1 and then storing that into zk. Mesos has several configuration flags to control this behavior, mainly --hostname, --no-hostname_lookup, --ip, --ip_discovery_command, and via setting the environment variable LIBPROCESS_IP. See http://mesos.apache.org/documentation/latest/configuration/ for details about them and what they do.

The best thing you can do to make sure things work out of the box is to make sure the machines have resolvable hostnames. Mesos does a reverse-DNS lookup of the boxes hostname in order to figure out what IP people will contact it from.

If you can't get the hostnames setup properly, I would recommend setting --hostname and --ip manually which should cause mesos to announce exactly what you want.

[1]The mesos slave has been renamed to agent, see: https://issues.apache.org/jira/browse/MESOS-1478

Mesos agent always in Deactivated state

I believe you did not set master IP correctly, following is a correct command. If use zk, you also can not use 127.0.0.1, FYI.

master

mesos-master --ip=192.168.201.131 --work_dir=/tmp/mesos

agent

mesos-agent --ip=192.168.201.128 --master=192.168.201.131:5050 --work_dir=/tmp/mesos

mesos-master can not found mesos-slave, and elect a new leader in a short interval

Thanks to Joseph Wu to help me solve the problem, detail:

There are two repeating log messages that tell you (indirectly) that something is wrong:

I0919 15:55:08.178272 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (14)@10.142.55.202:5050

This message means that you've started this master before, with the same work directory. It has some sort of persistent state in its work directory.

This log message tells you that there are two masters you have not started before:

I0919 15:55:16.018023 13282 consensus.cpp:360] Aborting implicit promise request because 2 ignores received

The masters will refuse to start because there is less than a quorum of masters with the persistent state. If the masters were to start, you would have potential data loss. This is the expected behavior, as Mesos errs on the side of caution.

If I need a fresh mesos cluster, I need clean work directory of the master.
But the problem is not on 10.142.55.202 as Joseph Wu says. I clear all the word_dir, and get out of this problem.

How to clean the work dir:

find mesos-master work dir

$ cat /etc/mesos-master/work_dir
/var/lib/mesos

remove it
```
$ rm -rf /var/lib/mesos
```

Transport Endpoint Not Connected - Mesos Slave/Master