Transport Endpoint Not Connected - Mesos Slave / Master
I had a similar problem.
My slave logs would be filled with
E0812 15:58:04.017990 2193 socket.hpp:107] Shutdown failed on fd=13: Transport endpoint is not connected [107]
My master would have
F0120 20:45:48.025610 12116 master.cpp:1083] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
And the master would die, and a new election would occur, the killed master would be restarted by upstart (I am on a Centos 6 box) and be added into the pool of potential masters. Thus my elected master would daisy chain around my master nodes. Many restarts of masters and slaves did nothing the problem would consistently return within 1 minute of master election.
The solution for me came from a this stackoverflow question (thanks) and a hint in a github gist note.
The gist of it is /etc/default/mesos-master
must specify a quorum number (it needs to be correct for the number of mesos masters, in my case 3)
MESOS_QUORUM=2
This seems odd to me as I have the same information in the file /etc/mesos-master/quorum
But I added it to /etc/default/mesos-master
restarted the mesos-masters and slaves and the problem has not returned.
I hope this helps you.
mesos slaves are not connecting with mesos masters cluster
So the problematic line is:
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.629093 2502 slave.cpp:3215] master@127.0.0.1:5050 exited
Specifically, note it's detecting the master as having the IP address 127.0.0.1. The Mesos Agent[1] sees that IP address, and tries to connect which fails (The master isn't running on the same machine as the agent).
This happens because the master announces what it thinks it's IP address is into Zookeeper. In your case, the master is thinking it's IP is 127.0.0.1 and then storing that into zk. Mesos has several configuration flags to control this behavior, mainly --hostname
, --no-hostname_lookup
, --ip
, --ip_discovery_command
, and via setting the environment variable LIBPROCESS_IP. See http://mesos.apache.org/documentation/latest/configuration/ for details about them and what they do.
The best thing you can do to make sure things work out of the box is to make sure the machines have resolvable hostnames. Mesos does a reverse-DNS lookup of the boxes hostname in order to figure out what IP people will contact it from.
If you can't get the hostnames setup properly, I would recommend setting --hostname
and --ip
manually which should cause mesos to announce exactly what you want.
[1]The mesos slave has been renamed to agent, see: https://issues.apache.org/jira/browse/MESOS-1478
Mesos agent always in Deactivated state
I believe you did not set master IP correctly, following is a correct command. If use zk, you also can not use 127.0.0.1, FYI.
master
mesos-master --ip=192.168.201.131 --work_dir=/tmp/mesos
agent
mesos-agent --ip=192.168.201.128 --master=192.168.201.131:5050 --work_dir=/tmp/mesos
mesos-master can not found mesos-slave, and elect a new leader in a short interval
Thanks to Joseph Wu to help me solve the problem, detail:
There are two repeating log messages that tell you (indirectly) that something is wrong:
I0919 15:55:08.178272 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (14)@10.142.55.202:5050
This message means that you've started this master before, with the same work directory. It has some sort of persistent state in its work directory.
This log message tells you that there are two masters you have not started before:
I0919 15:55:16.018023 13282 consensus.cpp:360] Aborting implicit promise request because 2 ignores received
The masters will refuse to start because there is less than a quorum of masters with the persistent state. If the masters were to start, you would have potential data loss. This is the expected behavior, as Mesos errs on the side of caution.
If I need a fresh mesos cluster, I need clean work directory of the master.
But the problem is not on 10.142.55.202
as Joseph Wu says. I clear all the word_dir, and get out of this problem.
How to clean the work dir:
find mesos-master work dir
$ cat /etc/mesos-master/work_dir
/var/lib/mesosremove it
$ rm -rf /var/lib/mesos
Related Topics
Visual Studio Code Debugger Error:"Could Not Find the Task 'Gcc Build Active File'
Looping Through Lines in a File in Bash, Without Using Stdin
How Can Linux Ptrace Be Unsafe or Contain a Race Condition
How to Pass Command Line Parameters with Quotes Stored in Single Variable
G++ Conio.H: No Such File or Directory
Find Command to Find Files and Concatenate Them
Jenkins to Run Maven Build on Linux or Windows
"Thread Apply All Bt Full" Gives Blank in Gdb
Can the Sys_Execve() System Call in the Linux Kernel Receive Both Absolute or Relative Paths
Suppress or Prevent Duplicate Inotifywait Events
Sorting CSV File by 5Th Column Using Bash
.Zshrc Config File Syntax Error
How to Get No. of Lines Count That Matches a String from All the Files in a Folder
How to Fetch the Tags for Ec2-Describe-Instances in a Shell Script
Temporarily Prevent Linux from Shutting Down