How to Convert an Integer to the Shortest Url-Safe String in Python

How to convert an integer to the shortest url-safe string in Python?

This answer is similar in spirit to Douglas Leeder's, with the following changes:

  • It doesn't use actual Base64, so there's no padding characters
  • Instead of converting the number first to a byte-string (base 256), it converts it directly to base 64, which has the advantage of letting you represent negative numbers using a sign character.

    import string
    ALPHABET = string.ascii_uppercase + string.ascii_lowercase + \
    string.digits + '-_'
    ALPHABET_REVERSE = dict((c, i) for (i, c) in enumerate(ALPHABET))
    BASE = len(ALPHABET)
    SIGN_CHARACTER = '$'

    def num_encode(n):
    if n < 0:
    return SIGN_CHARACTER + num_encode(-n)
    s = []
    while True:
    n, r = divmod(n, BASE)
    s.append(ALPHABET[r])
    if n == 0: break
    return ''.join(reversed(s))

    def num_decode(s):
    if s[0] == SIGN_CHARACTER:
    return -num_decode(s[1:])
    n = 0
    for c in s:
    n = n * BASE + ALPHABET_REVERSE[c]
    return n

    >>> num_encode(0)
'A'
>>> num_encode(64)
'BA'
>>> num_encode(-(64**5-1))
'$_____'

A few side notes:

  • You could (marginally) increase the human-readibility of the base-64 numbers by putting string.digits first in the alphabet (and making the sign character '-'); I chose the order that I did based on Python's urlsafe_b64encode.
  • If you're encoding a lot of negative numbers, you could increase the efficiency by using a sign bit or one's/two's complement instead of a sign character.
  • You should be able to easily adapt this code to different bases by changing the alphabet, either to restrict it to only alphanumeric characters or to add additional "URL-safe" characters.
  • I would recommend against using a representation other than base 10 in URIs in most cases—it adds complexity and makes debugging harder without significant savings compared to the overhead of HTTP—unless you're going for something TinyURL-esque.

How to convert an integer to a string in any base?

If you need compatibility with ancient versions of Python, you can either use gmpy (which does include a fast, completely general int-to-string conversion function, and can be built for such ancient versions – you may need to try older releases since the recent ones have not been tested for venerable Python and GMP releases, only somewhat recent ones), or, for less speed but more convenience, use Python code – e.g., for Python 2, most simply:

import string
digs = string.digits + string.ascii_letters

def int2base(x, base):
if x < 0:
sign = -1
elif x == 0:
return digs[0]
else:
sign = 1

x *= sign
digits = []

while x:
digits.append(digs[int(x % base)])
x = int(x / base)

if sign < 0:
digits.append('-')

digits.reverse()

return ''.join(digits)

For Python 3, int(x / base) leads to incorrect results, and must be changed to x // base:

import string
digs = string.digits + string.ascii_letters

def int2base(x, base):
if x < 0:
sign = -1
elif x == 0:
return digs[0]
else:
sign = 1

x *= sign
digits = []

while x:
digits.append(digs[x % base])
x = x // base

if sign < 0:
digits.append('-')

digits.reverse()

return ''.join(digits)

Converting integer to a 4 letter encoded string

In general, there are many options to do this. But in most cases there is no direct relation between integer ID and generated slug for short URL. I am pretty sure this case is not exception. I suppose they just generate random 4 letter string that never used before and maintain relation between film and short URL at the database level.

Encoding a numeric string into a shortened alphanumeric string, and back again

This is a pretty good compression:

import base64

def num_to_alpha(num):
num = hex(num)[2:].rstrip("L")

if len(num) % 2:
num = "0" + num

return base64.b64encode(num.decode('hex'))

It first turns the integer into a bytestring and then base64 encodes it. Here's the decoder:

def alpha_to_num(alpha):
num_bytes = base64.b64decode(alpha)
return int(num_bytes.encode('hex'), 16)

Example:

>>> num_to_alpha(20120425161608678259146181504021022591461815040210220120425161608667)
'vw4LUVm4Ea3fMnoTkHzNOlP6Z7eUAkHNdZjN2w=='
>>> alpha_to_num('vw4LUVm4Ea3fMnoTkHzNOlP6Z7eUAkHNdZjN2w==')
20120425161608678259146181504021022591461815040210220120425161608667

Short way to convert string to int

my_input = int(my_input)

There is no shorter way than using the int function (as you mention)

How to convert an integer to a string in any base?

If you need compatibility with ancient versions of Python, you can either use gmpy (which does include a fast, completely general int-to-string conversion function, and can be built for such ancient versions – you may need to try older releases since the recent ones have not been tested for venerable Python and GMP releases, only somewhat recent ones), or, for less speed but more convenience, use Python code – e.g., for Python 2, most simply:

import string
digs = string.digits + string.ascii_letters

def int2base(x, base):
if x < 0:
sign = -1
elif x == 0:
return digs[0]
else:
sign = 1

x *= sign
digits = []

while x:
digits.append(digs[int(x % base)])
x = int(x / base)

if sign < 0:
digits.append('-')

digits.reverse()

return ''.join(digits)

For Python 3, int(x / base) leads to incorrect results, and must be changed to x // base:

import string
digs = string.digits + string.ascii_letters

def int2base(x, base):
if x < 0:
sign = -1
elif x == 0:
return digs[0]
else:
sign = 1

x *= sign
digits = []

while x:
digits.append(digs[x % base])
x = x // base

if sign < 0:
digits.append('-')

digits.reverse()

return ''.join(digits)

Shortest possible encoded string with a decode possibility (shorten URL) using only PHP

I suspect that you will need to think more about your method of hashing if you don't want it to be decodable by the user. The issue with Base64 is that a Base64 string looks like a base64 string. There's a good chance that someone that's savvy enough to be looking at your page source will probably recognise it too.

Part one:

a method that encodes an string to shortest possible length

If you're flexible on your URL vocabulary/characters, this will be a good starting place. Since gzip makes a lot of its gains using back references, there is little point as the string is so short.

Consider your example - you've only saved 2 bytes in the compression, which are lost again in Base64 padding:

Non-gzipped: string(52) "aW1nPS9kaXIvZGlyL2hpLXJlcy1pbWcuanBnJnc9NzAwJmg9NTAw"

Gzipped: string(52) "y8xNt9VPySwC44xM3aLUYt3M3HS9rIJ0tXJbcwMDtQxbUwMDAA=="

If you reduce your vocabulary size, this will naturally allow you better compression. Let's say we remove some redundant information.

Take a look at the functions:

function compress($input, $ascii_offset = 38){
$input = strtoupper($input);
$output = '';
//We can try for a 4:3 (8:6) compression (roughly), 24 bits for 4 characters
foreach(str_split($input, 4) as $chunk) {
$chunk = str_pad($chunk, 4, '=');

$int_24 = 0;
for($i=0; $i<4; $i++){
//Shift the output to the left 6 bits
$int_24 <<= 6;

//Add the next 6 bits
//Discard the leading ASCII chars, i.e make
$int_24 |= (ord($chunk[$i]) - $ascii_offset) & 0b111111;
}

//Here we take the 4 sets of 6 apart in 3 sets of 8
for($i=0; $i<3; $i++) {
$output = pack('C', $int_24) . $output;
$int_24 >>= 8;
}
}

return $output;
}

And

function decompress($input, $ascii_offset = 38) {

$output = '';
foreach(str_split($input, 3) as $chunk) {

//Reassemble the 24 bit ints from 3 bytes
$int_24 = 0;
foreach(unpack('C*', $chunk) as $char) {
$int_24 <<= 8;
$int_24 |= $char & 0b11111111;
}

//Expand the 24 bits to 4 sets of 6, and take their character values
for($i = 0; $i < 4; $i++) {
$output = chr($ascii_offset + ($int_24 & 0b111111)) . $output;
$int_24 >>= 6;
}
}

//Make lowercase again and trim off the padding.
return strtolower(rtrim($output, '='));
}

It is basically a removal of redundant information, followed by the compression of 4 bytes into 3. This is achieved by effectively having a 6-bit subset of the ASCII table. This window is moved so that the offset starts at useful characters and includes all the characters you're currently using.

With the offset I've used, you can use anything from ASCII 38 to 102. This gives you a resulting string of 30 bytes, that's a 9-byte (24%) compression! Unfortunately, you'll need to make it URL-safe (probably with base64), which brings it back up to 40 bytes.

I think at this point, you're pretty safe to assume that you've reached the "security through obscurity" level required to stop 99.9% of people. Let's continue though, to the second part of your question

so the user can't guess how to get the larger image

It's arguable that this is already solved with the above, but you need to pass this through a secret on the server, preferably with PHP's OpenSSL interface. The following code shows the complete usage flow of functions above and the encryption:

$method = 'AES-256-CBC';
$secret = base64_decode('tvFD4Vl6Pu2CmqdKYOhIkEQ8ZO4XA4D8CLowBpLSCvA=');
$iv = base64_decode('AVoIW0Zs2YY2zFm5fazLfg==');

$input = 'img=/dir/dir/hi-res-img.jpg&w=700&h=500';
var_dump($input);

$compressed = compress($input);
var_dump($compressed);

$encrypted = openssl_encrypt($compressed, $method, $secret, false, $iv);
var_dump($encrypted);

$decrypted = openssl_decrypt($encrypted, $method, $secret, false, $iv);
var_dump($decrypted);

$decompressed = decompress($compressed);
var_dump($decompressed);

The output of this script is the following:

string(39) "img=/dir/dir/hi-res-img.jpg&w=700&h=500"
string(30) "<��(��tJ��@�xH��G&(�%��%��xW"
string(44) "xozYGselci9i70cTdmpvWkrYvGN9AmA7djc5eOcFoAM="
string(30) "<��(��tJ��@�xH��G&(�%��%��xW"
string(39) "img=/dir/dir/hi-res-img.jpg&w=700&h=500"

You'll see the whole cycle: compression → encryption → Base64 encode/decode → decryption → decompression. The output of this would be as close as possible as you could really get, at near the shortest length you could get.

Everything aside, I feel obliged to conclude this with the fact that it is theoretical only, and this was a nice challenge to think about. There are definitely better ways to achieve your desired result - I'll be the first to admit that my solution is a little bit absurd!



Related Topics



Leave a reply



Submit