Completely Random Identifier of a Given Length

Completely random identifier of a given length

Using this for an access token is a different story than UUIDs. You need not only pseudo-randomness but additionally this needs to be a cryptographically secure PRNG. If you don't really care what characters you use (they don't add anything to the security) you could use something as the following, producing a URL-safe Base64-encoded access token. URL-safeness becomes important in case you append the token to URLs, similar to what some Java web apps do: "http://www.bla.com/jsessionid=". If you would use raw Base64 strings for that purpose you would produce potentially invalid URLs.

require 'securerandom'

def produce_token(length=32)
token = SecureRandom.urlsafe_base64(length)
end

The probability of getting a duplicate is equal to 2^(-length). Since the output will be Base64-encoded, the actual output will be 4/3 * length long. If installed, this is based on the native OpenSSL PRNG implementation, so it should be pretty efficient in terms of performance. Should the OpenSSL extension not be installed, /dev/urandom will be used if available and finally, if you are on a Windows machine, CryptGenRandom would be used as fallback. Each of these options should be sufficiently performant. E.g., on my laptop running produce_tokena million times finishes in ~6s.

How to generate a unique identifier of a fixed length in Java?

Hmm... You could imitate a smaller GUID the following way. Let first 4 bytes of your string be the encoded current time - seconds passed after Unix. And the last 4 just a random combination. In this case the only way two ID's would coincide is that they were built at the same second. And the chances of that would be very veeery low because of the other 4 random characters.

Pseudocode:

get current time (4 byte integer
id[0] = 1st byte of current time (encoded to be a digit or a letter)
id[1] = 2nd
id[2] = 3rd
id[3] = 4th
id[4] = random character
id[5] = random character
id[6] = random character
id[7] = random character

generate unique string of length n without prefilled dictionary

I'm going to outline a solution for you that is going to resist casual inspection even by a knowledgeable person, though it probably IS NOT cryptographically secure.

First, your strings and numbers are in a one-to-one map. Here is some simple code for that.

alphabet = 'abcdefghijklmnopqrstuvwxyz'
len_of_codes = 7
char_to_pos = {}
for i in range(len(alphabet)):
char_to_pos[alphabet[i]] = i

def number_to_string(n):
chars = []
for _ in range(len_of_codes):
chars.append(alphabet[n % len(alphabet)])
n = n // len(alphabet)
return "".join(reversed(chars))

def string_to_number(s):
n = 0
for c in s:
n = len(alphabet) * n + char_to_pos[c]
return n

So now your problem is how to take an ascending stream of numbers and get an apparently random stream of numbers out of it instead. (Because you know how to turn those into strings.) Well, there are lots of tricks for primes, so let's find a decent sized prime that fits in the range that you want.

def is_prime (n):
for i in range(2, n):
if 0 == n%i:
return False
elif n < i*i:
return True
if n == 2:
return True
else:
return False

def last_prime_before (n):
for m in range(n-1, 1, -1):
if is_prime(m):
return m

print(last_prime_before(len(alphabet)**len_of_codes)

With this we find that we can use the prime 8031810103. That's how many numbers we'll be able to handle.

Now there is an easy way to scramble them. Which is to use the fact that multiplication modulo a prime scrambles the numbers in the range 1..(p-1).

def scramble1 (p, k, n):
return (n*k) % p

Picking a random number to scramble by, int(random.random() * 26**7) happened to give me 3661807866, we get a sequence we can calculate with:

for i in range(1, 5):
print(number_to_string(scramble1(8031810103, 3661807866, i)))

Which gives us

lwfdjoc
xskgtce
jopkctb
vkunmhd

This looks random to casual inspection. But will be reversible for any knowledgeable someone who puts modest effort in. They just have to guess the prime and algorithm that we used, look at 2 consecutive values to get the hidden parameter, then look at a couple of more to verify it.

Before addressing that, let's figure out how to take a string and get the number back. Thanks to Fermat's little theorem we know for p prime and 1 <= k < p that (k * k^(p-2)) % p == 1.

def n_pow_m_mod_k (n, m, k):
answer = 1
while 0 < m:
if 1 == m % 2:
answer = (answer * n) % k
m = m // 2
n = (n * n) % k
return answer

print(n_pow_m_mod_k(3661807866, 8031810103-2, 8031810103))

This gives us 3319920713. Armed with that we can calculate scramble1(8031810103, 3319920713, string_to_number("vkunmhd")) to find out that vkunmhd came from 4.

Now let's make it harder. Let's generate several keys to be scrambling with:

import random
p = 26**7
for i in range(5):
p = last_prime_before(p)
print((p, int(random.random() * p)))

When I ran this I happened to get:

(8031810103, 3661807866)
(8031810097, 3163265427)
(8031810091, 7069619503)
(8031809963, 6528177934)
(8031809917, 991731572)

Now let's scramble through several layers, working from smallest prime to largest (this requires reversing the sequence):

def encode (n):
for p, k in [
(8031809917, 991731572)
, (8031809963, 6528177934)
, (8031810091, 7069619503)
, (8031810097, 3163265427)
, (8031810103, 3661807866)
]:
n = scramble1(p, k, n)
return number_to_string(n)

This will give a sequence:

ehidzxf
shsifyl
gicmmcm
ofaroeg

And to reverse it just use the same trick that reversed the first scramble (reversing the primes so I am unscrambling in the order that I started with):

def decode (s):
n = string_to_number(s)
for p, k in [
(8031810103, 3319920713)
, (8031810097, 4707272543)
, (8031810091, 5077139687)
, (8031809963, 192273749)
, (8031809917, 5986071506)
]:
n = scramble1(p, k, n)

return n

TO BE CLEAR I do NOT promise that this is cryptographically secure. I'm not a cryptographer, and I'm aware enough of my limitations that I know not to trust it.

But I do promise that you'll have a sequence of over 8 billion strings that you are able to encode/decode with no obvious patterns.

Now take this code, scramble the alphabet, regenerate the magic numbers that I used, and choose a different number of layers to go through. I promise you that I personally have absolutely no idea how someone would even approach the problem of figuring out the algorithm. (But then again I'm not a cryptographer. Maybe they have some techniques to try. I sure don't.)

Unique string identifier with fixed length

As others pointed out, it is not possible to generate a unique alphanumeric string with length 11 from each email address, as the alphanumeric strings with length 11 are less than all the possible email addresses. So there will always be two different email addresses e1 and e2 such that F(e1) = F(e2) = S for some string S which you generate.

Still, given an email address e1, you can generate a string which is (let's call it) "pseudo-unique" i.e. a string S = F(e1) for which it is practically very hard to find another email address e2 != e1 such that F(e2) = S. And as others pointed out, you can use e.g. MD5 to achieve this.

Look at this class and call it on the byte array obtained from your email address.

http://docs.oracle.com/javase/7/docs/api/java/security/MessageDigest.html

Actually here is an example using MD5.

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

public class Test003 {

public static void main(String[] args) throws Exception {
System.out.println(getPseudoUniqueString("test1@test.com"));
System.out.println(getPseudoUniqueString("test2@test.com"));
}

private static String getPseudoUniqueString(String str)
throws NoSuchAlgorithmException
{
MessageDigest md1 = MessageDigest.getInstance("MD5");
md1.update(str.getBytes());
byte[] bd1 = md1.digest();

StringBuffer hexString = new StringBuffer();
for (int i=0;i<bd1.length;i++) {
String hex=Integer.toHexString(0xff & bd1[i]);
if(hex.length()==1) {
hexString.append('0');
}
hexString.append(hex);
}

return hexString.toString().substring(0,11);
}
}

How to generate a random string of a fixed length in Go?

Paul's solution provides a simple, general solution.

The question asks for the "the fastest and simplest way". Let's address the fastest part too. We'll arrive at our final, fastest code in an iterative manner. Benchmarking each iteration can be found at the end of the answer.

All the solutions and the benchmarking code can be found on the Go Playground. The code on the Playground is a test file, not an executable. You have to save it into a file named XX_test.go and run it with

go test -bench . -benchmem

Foreword:

The fastest solution is not a go-to solution if you just need a random string. For that, Paul's solution is perfect. This is if performance does matter. Although the first 2 steps (Bytes and Remainder) might be an acceptable compromise: they do improve performance by like 50% (see exact numbers in the II. Benchmark section), and they don't increase complexity significantly.

Having said that, even if you don't need the fastest solution, reading through this answer might be adventurous and educational.

I. Improvements

1. Genesis (Runes)

As a reminder, the original, general solution we're improving is this:

func init() {
rand.Seed(time.Now().UnixNano())
}

var letterRunes = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")

func RandStringRunes(n int) string {
b := make([]rune, n)
for i := range b {
b[i] = letterRunes[rand.Intn(len(letterRunes))]
}
return string(b)
}

2. Bytes

If the characters to choose from and assemble the random string contains only the uppercase and lowercase letters of the English alphabet, we can work with bytes only because the English alphabet letters map to bytes 1-to-1 in the UTF-8 encoding (which is how Go stores strings).

So instead of:

var letters = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")

we can use:

var letters = []byte("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")

Or even better:

const letters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

Now this is already a big improvement: we could achieve it to be a const (there are string constants but there are no slice constants). As an extra gain, the expression len(letters) will also be a const! (The expression len(s) is constant if s is a string constant.)

And at what cost? Nothing at all. strings can be indexed which indexes its bytes, perfect, exactly what we want.

Our next destination looks like this:

const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

func RandStringBytes(n int) string {
b := make([]byte, n)
for i := range b {
b[i] = letterBytes[rand.Intn(len(letterBytes))]
}
return string(b)
}

3. Remainder

Previous solutions get a random number to designate a random letter by calling rand.Intn() which delegates to Rand.Intn() which delegates to Rand.Int31n().

This is much slower compared to rand.Int63() which produces a random number with 63 random bits.

So we could simply call rand.Int63() and use the remainder after dividing by len(letterBytes):

func RandStringBytesRmndr(n int) string {
b := make([]byte, n)
for i := range b {
b[i] = letterBytes[rand.Int63() % int64(len(letterBytes))]
}
return string(b)
}

This works and is significantly faster, the disadvantage is that the probability of all the letters will not be exactly the same (assuming rand.Int63() produces all 63-bit numbers with equal probability). Although the distortion is extremely small as the number of letters 52 is much-much smaller than 1<<63 - 1, so in practice this is perfectly fine.

To make this understand easier: let's say you want a random number in the range of 0..5. Using 3 random bits, this would produce the numbers 0..1 with double probability than from the range 2..5. Using 5 random bits, numbers in range 0..1 would occur with 6/32 probability and numbers in range 2..5 with 5/32 probability which is now closer to the desired. Increasing the number of bits makes this less significant, when reaching 63 bits, it is negligible.

4. Masking

Building on the previous solution, we can maintain the equal distribution of letters by using only as many of the lowest bits of the random number as many is required to represent the number of letters. So for example if we have 52 letters, it requires 6 bits to represent it: 52 = 110100b. So we will only use the lowest 6 bits of the number returned by rand.Int63(). And to maintain equal distribution of letters, we only "accept" the number if it falls in the range 0..len(letterBytes)-1. If the lowest bits are greater, we discard it and query a new random number.

Note that the chance of the lowest bits to be greater than or equal to len(letterBytes) is less than 0.5 in general (0.25 on average), which means that even if this would be the case, repeating this "rare" case decreases the chance of not finding a good number. After n repetition, the chance that we still don't have a good index is much less than pow(0.5, n), and this is just an upper estimation. In case of 52 letters the chance that the 6 lowest bits are not good is only (64-52)/64 = 0.19; which means for example that chances to not have a good number after 10 repetition is 1e-8.

So here is the solution:

const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
const (
letterIdxBits = 6 // 6 bits to represent a letter index
letterIdxMask = 1<<letterIdxBits - 1 // All 1-bits, as many as letterIdxBits
)

func RandStringBytesMask(n int) string {
b := make([]byte, n)
for i := 0; i < n; {
if idx := int(rand.Int63() & letterIdxMask); idx < len(letterBytes) {
b[i] = letterBytes[idx]
i++
}
}
return string(b)
}

5. Masking Improved

The previous solution only uses the lowest 6 bits of the 63 random bits returned by rand.Int63(). This is a waste as getting the random bits is the slowest part of our algorithm.

If we have 52 letters, that means 6 bits code a letter index. So 63 random bits can designate 63/6 = 10 different letter indices. Let's use all those 10:

const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
const (
letterIdxBits = 6 // 6 bits to represent a letter index
letterIdxMask = 1<<letterIdxBits - 1 // All 1-bits, as many as letterIdxBits
letterIdxMax = 63 / letterIdxBits // # of letter indices fitting in 63 bits
)

func RandStringBytesMaskImpr(n int) string {
b := make([]byte, n)
// A rand.Int63() generates 63 random bits, enough for letterIdxMax letters!
for i, cache, remain := n-1, rand.Int63(), letterIdxMax; i >= 0; {
if remain == 0 {
cache, remain = rand.Int63(), letterIdxMax
}
if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
b[i] = letterBytes[idx]
i--
}
cache >>= letterIdxBits
remain--
}

return string(b)
}

6. Source

The Masking Improved is pretty good, not much we can improve on it. We could, but not worth the complexity.

Now let's find something else to improve. The source of random numbers.

There is a crypto/rand package which provides a Read(b []byte) function, so we could use that to get as many bytes with a single call as many we need. This wouldn't help in terms of performance as crypto/rand implements a cryptographically secure pseudorandom number generator so it's much slower.

So let's stick to the math/rand package. The rand.Rand uses a rand.Source as the source of random bits. rand.Source is an interface which specifies a Int63() int64 method: exactly and the only thing we needed and used in our latest solution.

So we don't really need a rand.Rand (either explicit or the global, shared one of the rand package), a rand.Source is perfectly enough for us:

var src = rand.NewSource(time.Now().UnixNano())

func RandStringBytesMaskImprSrc(n int) string {
b := make([]byte, n)
// A src.Int63() generates 63 random bits, enough for letterIdxMax characters!
for i, cache, remain := n-1, src.Int63(), letterIdxMax; i >= 0; {
if remain == 0 {
cache, remain = src.Int63(), letterIdxMax
}
if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
b[i] = letterBytes[idx]
i--
}
cache >>= letterIdxBits
remain--
}

return string(b)
}

Also note that this last solution doesn't require you to initialize (seed) the global Rand of the math/rand package as that is not used (and our rand.Source is properly initialized / seeded).

One more thing to note here: package doc of math/rand states:

The default Source is safe for concurrent use by multiple goroutines.

So the default source is slower than a Source that may be obtained by rand.NewSource(), because the default source has to provide safety under concurrent access / use, while rand.NewSource() does not offer this (and thus the Source returned by it is more likely to be faster).

7. Utilizing strings.Builder

All previous solutions return a string whose content is first built in a slice ([]rune in Genesis, and []byte in subsequent solutions), and then converted to string. This final conversion has to make a copy of the slice's content, because string values are immutable, and if the conversion would not make a copy, it could not be guaranteed that the string's content is not modified via its original slice. For details, see How to convert utf8 string to []byte? and golang: []byte(string) vs []byte(*string).

Go 1.10 introduced strings.Builder. strings.Builder is a new type we can use to build contents of a string similar to bytes.Buffer. Internally it uses a []byte to build the content, and when we're done, we can obtain the final string value using its Builder.String() method. But what's cool in it is that it does this without performing the copy we just talked about above. It dares to do so because the byte slice used to build the string's content is not exposed, so it is guaranteed that no one can modify it unintentionally or maliciously to alter the produced "immutable" string.

So our next idea is to not build the random string in a slice, but with the help of a strings.Builder, so once we're done, we can obtain and return the result without having to make a copy of it. This may help in terms of speed, and it will definitely help in terms of memory usage and allocations.

func RandStringBytesMaskImprSrcSB(n int) string {
sb := strings.Builder{}
sb.Grow(n)
// A src.Int63() generates 63 random bits, enough for letterIdxMax characters!
for i, cache, remain := n-1, src.Int63(), letterIdxMax; i >= 0; {
if remain == 0 {
cache, remain = src.Int63(), letterIdxMax
}
if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
sb.WriteByte(letterBytes[idx])
i--
}
cache >>= letterIdxBits
remain--
}

return sb.String()
}

Do note that after creating a new strings.Buidler, we called its Builder.Grow() method, making sure it allocates a big-enough internal slice (to avoid reallocations as we add the random letters).

8. "Mimicing" strings.Builder with package unsafe

strings.Builder builds the string in an internal []byte, the same as we did ourselves. So basically doing it via a strings.Builder has some overhead, the only thing we switched to strings.Builder for is to avoid the final copying of the slice.

strings.Builder avoids the final copy by using package unsafe:

// String returns the accumulated string.
func (b *Builder) String() string {
return *(*string)(unsafe.Pointer(&b.buf))
}

The thing is, we can also do this ourselves, too. So the idea here is to switch back to building the random string in a []byte, but when we're done, don't convert it to string to return, but do an unsafe conversion: obtain a string which points to our byte slice as the string data.

This is how it can be done:

func RandStringBytesMaskImprSrcUnsafe(n int) string {
b := make([]byte, n)
// A src.Int63() generates 63 random bits, enough for letterIdxMax characters!
for i, cache, remain := n-1, src.Int63(), letterIdxMax; i >= 0; {
if remain == 0 {
cache, remain = src.Int63(), letterIdxMax
}
if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
b[i] = letterBytes[idx]
i--
}
cache >>= letterIdxBits
remain--
}

return *(*string)(unsafe.Pointer(&b))
}

(9. Using rand.Read())

Go 1.7 added a rand.Read() function and a Rand.Read() method. We should be tempted to use these to read as many bytes as we need in one step, in order to achieve better performance.

There is one small "problem" with this: how many bytes do we need? We could say: as many as the number of output letters. We would think this is an upper estimation, as a letter index uses less than 8 bits (1 byte). But at this point we are already doing worse (as getting the random bits is the "hard part"), and we're getting more than needed.

Also note that to maintain equal distribution of all letter indices, there might be some "garbage" random data that we won't be able to use, so we would end up skipping some data, and thus end up short when we go through all the byte slice. We would need to further get more random bytes, "recursively". And now we're even losing the "single call to rand package" advantage...

We could "somewhat" optimize the usage of the random data we acquire from math.Rand(). We may estimate how many bytes (bits) we'll need. 1 letter requires letterIdxBits bits, and we need n letters, so we need n * letterIdxBits / 8.0 bytes rounding up. We can calculate the probability of a random index not being usable (see above), so we could request more that will "more likely" be enough (if it turns out it's not, we repeat the process). We can process the byte slice as a "bit stream" for example, for which we have a nice 3rd party lib: github.com/icza/bitio (disclosure: I'm the author).

But Benchmark code still shows we're not winning. Why is it so?

The answer to the last question is because rand.Read() uses a loop and keeps calling Source.Int63() until it fills the passed slice. Exactly what the RandStringBytesMaskImprSrc() solution does, without the intermediate buffer, and without the added complexity. That's why RandStringBytesMaskImprSrc() remains on the throne. Yes, RandStringBytesMaskImprSrc() uses an unsynchronized rand.Source unlike rand.Read(). But the reasoning still applies; and which is proven if we use Rand.Read() instead of rand.Read() (the former is also unsynchronzed).

II. Benchmark

All right, it's time for benchmarking the different solutions.

Moment of truth:

BenchmarkRunes-4                     2000000    723 ns/op   96 B/op   2 allocs/op
BenchmarkBytes-4 3000000 550 ns/op 32 B/op 2 allocs/op
BenchmarkBytesRmndr-4 3000000 438 ns/op 32 B/op 2 allocs/op
BenchmarkBytesMask-4 3000000 534 ns/op 32 B/op 2 allocs/op
BenchmarkBytesMaskImpr-4 10000000 176 ns/op 32 B/op 2 allocs/op
BenchmarkBytesMaskImprSrc-4 10000000 139 ns/op 32 B/op 2 allocs/op
BenchmarkBytesMaskImprSrcSB-4 10000000 134 ns/op 16 B/op 1 allocs/op
BenchmarkBytesMaskImprSrcUnsafe-4 10000000 115 ns/op 16 B/op 1 allocs/op

Just by switching from runes to bytes, we immediately have 24% performance gain, and memory requirement drops to one third.

Getting rid of rand.Intn() and using rand.Int63() instead gives another 20% boost.

Masking (and repeating in case of big indices) slows down a little (due to repetition calls): -22%...

But when we make use of all (or most) of the 63 random bits (10 indices from one rand.Int63() call): that speeds up big time: 3 times.

If we settle with a (non-default, new) rand.Source instead of rand.Rand, we again gain 21%.

If we utilize strings.Builder, we gain a tiny 3.5% in speed, but we also achieved 50% reduction in memory usage and allocations! That's nice!

Finally if we dare to use package unsafe instead of strings.Builder, we again gain a nice 14%.

Comparing the final to the initial solution: RandStringBytesMaskImprSrcUnsafe() is 6.3 times faster than RandStringRunes(), uses one sixth memory and half as few allocations. Mission accomplished.

Random string generation with upper case letters and digits

Answer in one line:

''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(N))

or even shorter starting with Python 3.6 using random.choices():

''.join(random.choices(string.ascii_uppercase + string.digits, k=N))

A cryptographically more secure version: see this post

''.join(random.SystemRandom().choice(string.ascii_uppercase + string.digits) for _ in range(N))

In details, with a clean function for further reuse:

>>> import string
>>> import random
>>> def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
... return ''.join(random.choice(chars) for _ in range(size))
...
>>> id_generator()
'G5G74W'
>>> id_generator(3, "6793YUIO")
'Y3U'

How does it work ?

We import string, a module that contains sequences of common ASCII characters, and random, a module that deals with random generation.

string.ascii_uppercase + string.digits just concatenates the list of characters representing uppercase ASCII chars and digits:

>>> string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> string.digits
'0123456789'
>>> string.ascii_uppercase + string.digits
'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'

Then we use a list comprehension to create a list of 'n' elements:

>>> range(4) # range create a list of 'n' numbers
[0, 1, 2, 3]
>>> ['elem' for _ in range(4)] # we use range to create 4 times 'elem'
['elem', 'elem', 'elem', 'elem']

In the example above, we use [ to create the list, but we don't in the id_generator function so Python doesn't create the list in memory, but generates the elements on the fly, one by one (more about this here).

Instead of asking to create 'n' times the string elem, we will ask Python to create 'n' times a random character, picked from a sequence of characters:

>>> random.choice("abcde")
'a'
>>> random.choice("abcde")
'd'
>>> random.choice("abcde")
'b'

Therefore random.choice(chars) for _ in range(size) really is creating a sequence of size characters. Characters that are randomly picked from chars



Related Topics



Leave a reply



Submit