Python Urllib2 with Keep Alive

Python urllib2 with keep alive

Use the urlgrabber library. This includes an HTTP handler for urllib2 that supports HTTP 1.1 and keepalive:

>>> import urllib2
>>> from urlgrabber.keepalive import HTTPHandler
>>> keepalive_handler = HTTPHandler()
>>> opener = urllib2.build_opener(keepalive_handler)
>>> urllib2.install_opener(opener)
>>> 
>>> fo = urllib2.urlopen('http://www.python.org')

Note: you should use urlgrabber version 3.9.0 or earlier, as the keepalive module has been removed in version 3.9.1

There is a port of the keepalive module to Python 3.

What is the best way to use HTTP Keep-Alive in Python 2.7

I would suggest to use requests library. It has Keep-Alive support in addition to many other features.

python keep alive response object

@ScottHunter in your solution also we are reading the lines which read already,only thing we are doing is reading lines and skipping if it already read.

So the solution I implemented is- read limited characters at a time using readline with character limit

from urllib2 import urlopen
response = urlopen(url)
while True:
    line = response.readline(4096)
    if not line:
        break
    do_some_job(line)
response.close()

Persistence of urllib.request connections to a HTTP server

urllib.request doesn't support persistent connections. There is 'Connection: close' hardcoded in the code. But http.client partially supports persistent connections (including legacy http/1.0 keep-alive). So the question title might be misleading.

I want to do some performance testing on one of our web servers, to see how the server handles a lot of persistent connections. Unfortunately, I'm not terribly familiar with HTTP and web testing.

You could use an existing http testing tools such as slowloris, httperf instead of writing one yourself.

How do I keep these connections alive?

To close http/1.1 connection a client should explicitly specify Connection: close header otherwise the connection is considered persistent by the server (though it may close it at any moment and http.client won't know about it until it tries to read/write to the connection).

conn.connect() returns almost immediately and your thread ends. To force each thread to maintain an http connection to the server you could:

import time

def make_http_connection(*args, **kwargs):
    while True: # make new http connections
        h = http.client.HTTPConnection(*args, **kwargs)
        while True: # make multiple requests using a single connection
            try:
                h.request('GET', '/') # send request; make conn. on the first run
                response = h.getresponse()
                while True: # read response slooowly
                    b = response.read(1) # read 1 byte
                    if not b:
                       break
                    time.sleep(60) # wait a minute before reading next byte
                    #note: the whole minute might pass before we notice that 
                    #  the server has closed the connection already
            except Exception:
                break # make new connection on any error

Note: if the server returns 'Connection: close' then there is a single request per connection.

(Also, on an unrelated note, is there a better procedure for waiting for a keyboard interrupt than the ugly while True: block at the end of my code?)

To wait until all threads finish or KeyboardInterrupt happens you could:

while threads:
    try:
        for t in threads[:]: # enumerate threads
            t.join(.1) # timeout 0.1 seconds
            if not t.is_alive():
               threads.remove(t)
    except KeyboardInterrupt:
        break

Or something like this:

while threading.active_count() > 1:
    try:
        main_thread = threading.current_thread()
        for t in threading.enumerate(): # enumerate all alive threads
            if t is not main_thread:
               t.join(.1)
    except KeyboardInterrupt:
        break

The later might not work for various reasons e.g., if there are dummy threads such as threads that started in C extensions without using threading module.

concurrent.futures.ThreadPoolExecutor provides a higher abstraction level than threading module and it can hide some complexity.

Instead of thread per connection model you could open multiple connections concurrently in a single thread e.g., using requests.async or gevent directly.

How to Speed Up Python's urllib2 when doing multiple requests

If you switch to httplib, you will have finer control over the underlying connection.

For example:

import httplib

conn = httplib.HTTPConnection(url)

conn.request('GET', '/foo')
r1 = conn.getresponse()
r1.read()

conn.request('GET', '/bar')
r2 = conn.getresponse()
r2.read()

conn.close()

This would send 2 HTTP GETs on the same underlying TCP connection.

Why aren't persistent connections supported by URLLib2?

It's a well-known limit of urllib2 (and urllib as well). IMHO the best attempt so far to fix it and make it right is Garry Bodsworth's coda_network for Python 2.6 or 2.7 -- replacement, patched versions of urllib2 (and some other modules) to support keep-alive (and a bunch of other smaller but quite welcome fixes).

Python urllib2 - Freezes when connection temporarily dies

According to the docs, the default timeout is, indeed, no timeout. You can specify a timeout when calling urlopen though. :)

page = urllib2.urlopen(req, timeout=30)

Python Urllib2 with Keep Alive