Python 'Requests' Library - Define Specific Dns

Python 'requests' library - define specific DNS?

requests uses urllib3, which ultimately uses httplib.HTTPConnection as well, so the techniques from ~~https://stackoverflow.com/questions/4623090/python-set-custom-dns-server-for-urllib-requests~~ (now deleted, it merely linked to Tell urllib2 to use custom DNS) still apply, to a certain extent.

The urllib3.connection module subclasses httplib.HTTPConnection under the same name, having replaced the .connect() method with one that calls self._new_conn. In turn, this delegates to urllib3.util.connection.create_connection(). It is perhaps easiest to patch that function:

from urllib3.util import connection

_orig_create_connection = connection.create_connection

def patched_create_connection(address, *args, **kwargs):
    """Wrap urllib3's create_connection to resolve the name elsewhere"""
    # resolve hostname to an ip address; use your own
    # resolver here, as otherwise the system resolver will be used.
    host, port = address
    hostname = your_dns_resolver(host)

    return _orig_create_connection((hostname, port), *args, **kwargs)

connection.create_connection = patched_create_connection

and you'd provide your own code to resolve the host portion of the address into an ip address instead of relying on the connection.create_connection() call (which wraps socket.create_connection()) to resolve the hostname for you.

Like all monkeypatching, be careful that the code hasn't significantly changed in later releases; the patch here was created against urllib3 version 1.21.1. but should work for versions as far back as 1.9.

Note that this answer was re-written to work with newer urllib3 releases, which have added a much more convenient patching location. See the edit history for the old method, applicable to version < 1.9, as a patch to the vendored urllib3 version rather than a stand-alone installation.

How do I specify URL resolution in python's requests library in a similar fashion to curl's --resolve flag?

After doing a bit of digging, I (unsurprisingly) found that Requests resolves hostnames by asking Python to do it (which is asking your operating system to do it). First I found some sample code to hijack DNS resolution (Tell urllib2 to use custom DNS) and then I figured out a few more details about how Python resolves hostnames in the socket documentation. Then it was just a matter of wiring everything together:

import socket
import requests

def is_ipv4(s):
    # Feel free to improve this: https://stackoverflow.com/questions/11827961/checking-for-ip-addresses
    return ':' not in s

dns_cache = {}

def add_custom_dns(domain, port, ip):
    key = (domain, port)
    # Strange parameters explained at:
    # https://docs.python.org/2/library/socket.html#socket.getaddrinfo
    # Values were taken from the output of `socket.getaddrinfo(...)`
    if is_ipv4(ip):
        value = (socket.AddressFamily.AF_INET, 0, 0, '', (ip, port))
    else: # ipv6
        value = (socket.AddressFamily.AF_INET6, 0, 0, '', (ip, port, 0, 0))
    dns_cache[key] = [value]

# Inspired by: https://stackoverflow.com/a/15065711/868533
prv_getaddrinfo = socket.getaddrinfo
def new_getaddrinfo(*args):
    # Uncomment to see what calls to `getaddrinfo` look like.
    # print(args)
    try:
        return dns_cache[args[:2]] # hostname and port
    except KeyError:
        return prv_getaddrinfo(*args)

socket.getaddrinfo = new_getaddrinfo

# Redirect example.com to the IP of test.domain.com (completely unrelated).
add_custom_dns('example.com', 80, '66.96.162.92')
res = requests.get('http://example.com')
print(res.text) # Prints out the HTML of test.domain.com.

Some caveats I ran into while writing this:

This works poorly for https. The code works fine (just use https:// and 443 instead of http:// and 80). However, SSL certificates are tied to domain names and Requests is going to try validating the name on the certificate to the original domain you tried connecting to.
getaddrinfo returns slightly different info for IPv4 and IPv6 addresses. My implementation for is_ipv4 feels hacky to me and I strongly recommend a better version if you're using this in a real application.
The code has been tested on Python 3 but I see no reason why it wouldn't work as-is on Python 2.

Set DNS timeout for HTTP requests using requests library

So, how can I best give my HTTP requests a very low DNS resolution timeout in the case that it is being fed a dead link?

Separate things.

Use urllib.parse to extract the hostname from the URL, and then use dnspython to resolve that name, with whatever timeout you want.

Then, and only if the resolution was correct, fire up requests to grab the HTTP data.

@blurfus: in requests you can only use the timeout parameter in the HTTP call, you can't attach it to a session. It is not spelled out explicitly in the documentation, but the code is quite clear on that.

There are many links that this program needs to check,

That is a completely separate problem in fact, and exists even if all links are ok, it is just a problem of volume.

The typical solutions fell in two cases:

use asynchronous libraries (they exist for both DNS and HTTP), where your calls are not blocking, you get the data later, so you are able to do something else
use multiprocessing or multithreading to parallelize things and have multiple URLs being tested at the same time by separate instances of your code.

They are not completely mutually exclusive, you can find a lot of pros and cons for each, asynchronous codes may be more complicated to write and understand later, so multiprocessing/multithreading is often the first step for a "quick win" (especially if you do not need to share anything between the processes/threads, otherwise it becomes quickly a problme), yet asynchronous handling of everything makes the code scales more nicely with the volume.

Using SRV DNS records with the python requests library

I ended up writing a patch for requests that would do this using this answer. I had to make some changes due to updates to the requests library. This patch works with requests version 2.11.1.

I used the dnspython library to resolve the SRV records and it expects the IP address and port that consul is listening for DNS requests on to be available as the environment variable CONSUL_DNS_IP_PORT. To use the patch import the requests_use_srv_records function from whatever module the patch is in and then call it. It will only attempt to use consul SRV records for hosts that end with .service.consul, other hosts will be resolved regularly.

Here's the patch:

# Python Imports
import os
from socket import error as SocketError, timeout as SocketTimeout

# 3rd Party Imports
from dns import resolver
from requests.packages.urllib3.connection import HTTPConnection
from requests.packages.urllib3.exceptions import (NewConnectionError,
                                                  ConnectTimeoutError)
from requests.packages.urllib3.util import connection

def resolve_srv_record(host):

    consul_dns_ip_port = os.environ.get('CONSUL_DNS_IP_PORT', 
                                        '172.17.0.1:53')
    consul_dns_ip, consul_dns_port = consul_dns_ip_port.split(':')

    res = resolver.Resolver()
    res.port = consul_dns_port
    res.nameservers = [consul_dns_ip]

    ans = resolver.query(host, 'SRV')

    return ans.response.additional[0].items[0].address, ans[0].port

def patched_new_conn(self):

    if self.host.endswith('.service.consul'):
        hostname, port = resolve_srv_record(self.host)
    else:
        hostname = self.host
        port = self.port

    extra_kw = {}

    if self.source_address:
        extra_kw['source_address'] = self.source_address

    if self.socket_options:
        extra_kw['socket_options'] = self.socket_options

    try:
        conn = connection.create_connection((hostname, port),
                                            self.timeout,
                                            **extra_kw)

    except SocketTimeout as e:
        raise ConnectTimeoutError(
            self, "Connection to %s timed out. (connect timeout=%s)" %
            (self.host, self.timeout))

    except SocketError as e:
        raise NewConnectionError(
            self, "Failed to establish a new connection: %s" % e)

    return conn

def requests_use_srv_records():
    HTTPConnection._new_conn = patched_new_conn

How to make `requests` use a different hostname for TLS validation than DNS resolution?

Requests supports SSL Verification by default. However, it relies on the user making a request with the URL that has the hostname in it. If, however, the user needs to make a request with the IP address, they cannot actually verify a certificate against the hostname they want to request.

To accomodate this need, there is HostHeaderSSLAdapter. Example usage:

import requests
from requests_toolbelt.adapters import host_header_ssl
s = requests.Session()
s.mount('https://', host_header_ssl.HostHeaderSSLAdapter())
s.get("https://my-search-domain.example.com", headers={"Host": "vpc-elasti-xyz.us-east-1.es.amazonaws.com"})

https://toolbelt.readthedocs.io/en/latest/adapters.html#hostheaderssladapter

Python 'Requests' Library - Define Specific Dns