Linux Tcp Connect with Select() Fails at Testserver

Linux TCP connect with Select() fails at testserver

with nonblocking socket connect() call may return 0 with the connection is still not ready
the connect() code section, may be written like this(my connect wraper code segment learnt from the python implementation):

    if (FAIL_CHECK(connect(sock, (struct sockaddr *) &channel, sizeof(channel)) &&
            errno != EINPROGRESS))
    {
        gko_log(WARNING, "connect error");
        ret = HOST_DOWN_FAIL;
        goto CONNECT_END;
    }

    /** Wait for write bit to be set **/
#if HAVE_POLL
    {
        struct pollfd pollfd;

        pollfd.fd = sock;
        pollfd.events = POLLOUT;

        /* send_sec is in seconds, timeout in ms */
        select_ret = poll(&pollfd, 1, (int)(send_sec * 1000 + 1));
    }
#else
    {
        FD_ZERO(&wset);
        FD_SET(sock, &wset);
        select_ret = select(sock + 1, 0, &wset, 0, &send_timeout);
    }
#endif /* HAVE_POLL */
    if (select_ret < 0)
    {
        gko_log(FATAL, "select/poll error on connect");
        ret = HOST_DOWN_FAIL;
        goto CONNECT_END;
    }
    if (!select_ret)
    {
        gko_log(FATAL, "connect timeout on connect");
        ret = HOST_DOWN_FAIL;
        goto CONNECT_END;
    }

python version code segment:

res = connect(s->sock_fd, addr, addrlen);
if (s->sock_timeout > 0.0) {
    if (res < 0 && errno == EINPROGRESS && IS_SELECTABLE(s)) {
        timeout = internal_select(s, 1);
        if (timeout == 0) {
            /* Bug #1019808: in case of an EINPROGRESS,
               use getsockopt(SO_ERROR) to get the real
               error. */
            socklen_t res_size = sizeof res;
            (void)getsockopt(s->sock_fd, SOL_SOCKET,
                             SO_ERROR, &res, &res_size);
            if (res == EISCONN)
                res = 0;
            errno = res;
        }
        else if (timeout == -1) {
            res = errno;            /* had error */
        }
        else
            res = EWOULDBLOCK;                      /* timed out */
    }
}

if (res < 0)
    res = errno;

Connect Timeout with Alarm()

signal is a massively under-specified interface and should be avoided in new code. On some versions of Linux, I believe it provides "BSD semantics", which means (among other things) that providing SA_RESTART by default.

Use sigaction instead, do not specify SA_RESTART, and you should be good to go.

...

Well, except for the general fragility and unavoidable race conditions, that is. connect will return EINTR for any signal, not just SIGALARM. More troublesome, if the system happens to be under heavy load, it could take more than 5 seconds between the call to alarm and the call to connect, in which case you will miss the signal and block in connect forever.

Your earlier attempt, using non-blocking sockets with connect and select, was a much better idea. I would suggest debugging that.

Synchronizing a Test Server During Tests

You can just attempt to connect to the server before starting the test suite, as part of the initialization process.

For example, I usually have a function like this in my tests:

// waitForServer attempts to establish a TCP connection to localhost:<port>
// in a given amount of time. It returns upon a successful connection; 
// ptherwise exits with an error.
func waitForServer(port string) {
    backoff := 50 * time.Millisecond

    for i := 0; i < 10; i++ {
        conn, err := net.DialTimeout("tcp", ":"+port, 1*time.Second)
        if err != nil {
            time.Sleep(backoff)
            continue
        }
        err = conn.Close()
        if err != nil {
            log.Fatal(err)
        }
        return
    }
    log.Fatalf("Server on port %s not up after 10 attempts", port)
}

Then in my TestMain() I do:

func TestMain(m *testing.M) {
    go startServer()
    waitForServer(serverPort)

    // run the suite
    os.Exit(m.Run())
}

socket.error: [Errno 10013] An attempt was made to access a socket in a way forbidden by its access permissions

On Windows Vista/7, with UAC, administrator accounts run programs in unprivileged mode by default.

Programs must prompt for administrator access before they run as administrator, with the ever-so-familiar UAC dialog. Since Python scripts aren't directly executable, there's no "Run as Administrator" context menu option.

It's possible to use ctypes.windll.shell32.IsUserAnAdmin() to detect whether the script has admin access, and ShellExecuteEx with the 'runas' verb on python.exe, with sys.argv[0] as a parameter to prompt the UAC dialog if needed.

Socket accept - Too many open files

There are multiple places where Linux can have limits on the number of file descriptors you are allowed to open.

You can check the following:

cat /proc/sys/fs/file-max

That will give you the system wide limits of file descriptors.

On the shell level, this will tell you your personal limit:

ulimit -n

This can be changed in /etc/security/limits.conf - it's the nofile param.

However, if you're closing your sockets correctly, you shouldn't receive this unless you're opening a lot of simulataneous connections. It sounds like something is preventing your sockets from being closed appropriately. I would verify that they are being handled properly.

Linux Tcp Connect with Select() Fails at Testserver