How to Generate a Guid in R

How can I generate a GUID in R?

The optimal choice for this now is the uuid package. It consists of one function (UUIDgenerate) that doesn't rely on R's internal random number generators and so doesn't suffer any consequences from using set.seed in a session as @thelatemail's answer does. You can choose to have the UUID be generated either by the package's internal random number generator or based on time.

A faster way to generate a vector of UUIDs in R

Bottom line up front: no, there is currently no way to speed up generation of a lot of UUIDs with uuid without compromising the core premise of uniqueness. (Using uuid, that is.)

In fact, your suggestion to use use.time=FALSE has significantly bad ramifications (on windows). See below.

It is possible to get faster performance at scale, just not with uuid. See below.

uuid on Windows

Performance of uuid::UUIDgenerate should take into account the OS. More specifically, the source of randomness. It's important to look at performance, yes, where:

library(microbenchmark)
microbenchmark(
rf=replicate(1000, uuid::UUIDgenerate(FALSE)),
rt=replicate(1000, uuid::UUIDgenerate(TRUE)),
sf=sapply(1:1000, function(ign) uuid::UUIDgenerate(FALSE)),
st=sapply(1:1000, function(ign) uuid::UUIDgenerate(TRUE))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rf 8.675561 9.330877 11.73299 10.14592 11.75467 66.2435 100
# rt 89.446158 90.003196 91.53226 90.94095 91.13806 136.9411 100
# sf 8.570900 9.270524 11.28199 10.22779 12.06993 24.3583 100
# st 89.359366 90.189178 91.73793 90.95426 91.89822 137.4713 100

... so using use.time=FALSE is always faster. (I included the sapply examples for comparison with your answer's code, to show that replicate is never slower. Use replicate here unless you feel you need the numeric argument for some reason.)

However, there is a problem:

R.version[1:3]
# _
# platform x86_64-w64-mingw32
# arch x86_64
# os mingw32
length(unique(replicate(1000, uuid::UUIDgenerate(TRUE))))
# [1] 1000
length(unique(replicate(1000, uuid::UUIDgenerate(FALSE))))
# [1] 20

Given that a UUID is intended to be unique each time called, this is disturbing, and is a symptom of insufficient randomness on windows. (Does WSL provide a way out for this? Another research opportunity ...)

uuid on Linux

For comparison, the same results on a non-windows platform:

microbenchmark(
rf=replicate(1000, uuid::UUIDgenerate(FALSE)),
rt=replicate(1000, uuid::UUIDgenerate(TRUE)),
sf=sapply(1:1000, function(ign) uuid::UUIDgenerate(FALSE)),
st=sapply(1:1000, function(ign) uuid::UUIDgenerate(TRUE))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rf 20.852227 21.48981 24.90932 22.30334 25.11449 74.20972 100
# rt 9.782106 11.03714 14.15256 12.04848 15.41695 100.83724 100
# sf 20.250873 21.39140 24.67585 22.44717 27.51227 44.43504 100
# st 9.852275 11.15936 13.34731 12.11374 15.03694 27.79595 100

R.version[1:3]
# _
# platform x86_64-pc-linux-gnu
# arch x86_64
# os linux-gnu
length(unique(replicate(1000, uuid::UUIDgenerate(TRUE))))
# [1] 1000
length(unique(replicate(1000, uuid::UUIDgenerate(FALSE))))
# [1] 1000

(I'm slightly intrigued by the fact that use.time=FALSE on linux takes twice as long as on windows ...)

UUID generation with a SQL server

If you have access to a SQL server (you almost certainly do ... see SQLite ...), then you can deal with this scale problem by employing the server's implementation of UUID generation, recognizing that there are some slight differences.

(Side note: there are "V4" (completely random), "V1" (time-based), and "V1mc" (time-based and includes the system's mac address) UUIDs. uuid gives V4 if use.time=FALSE and V1 otherwise, encoding the system's mac address.)

Some performance comparisons on windows (all times in seconds):

#         n  uuid postgres sqlite sqlserver
# 1 100 0 1.23 1.13 0.84
# 2 1000 0.05 1.13 1.21 1.08
# 3 10000 0.47 1.35 1.45 1.17
# 4 100000 5.39 3.10 3.50 2.68
# 5 1000000 63.48 16.61 17.47 16.31

The use of SQL has some overhead that does not take long to overcome when done at scale.

  • PostgreSQL needs the uuid-ossp extension, installable with

    CREATE EXTENSION "uuid-ossp"

    Once installed/available, you can generate n UUIDs with:

    n <- 3
    pgcon <- DBI::dbConnect(...)
    DBI::dbGetQuery(pgcon, sprintf("select uuid_generate_v1mc() as uuid from generate_series(1,%d)", n))
    # uuid
    # 1 53cd17c6-3c21-11e8-b2bf-7bab2a3c8486
    # 2 53cd187a-3c21-11e8-b2bf-dfe12d92673e
    # 3 53cd18f2-3c21-11e8-b2bf-d3c64c6ad73f

    Other UUID functions exists. https://www.postgresql.org/docs/9.6/static/uuid-ossp.html

  • SQLite includes limited ability to do it, but this hack works well enough for a V4-style UUID (length n):

    sqlitecon <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") # or your own
    DBI::dbGetQuery(sqlitecon, sprintf("
    WITH RECURSIVE cnt(x) as (
    select 1 union all select x+1 from cnt limit %d
    )
    select (hex(randomblob(4))||'-'||hex(randomblob(2))||'-'||hex(randomblob(2))||'-'||hex(randomblob(2))||'-'||hex(randomblob(6))) as uuid
    from cnt", n))
    # uuid
    # 1 EE6B08DA-2991-BF82-55DD-78FEA48ABF43
    # 2 C195AAA4-67FC-A1C0-6675-E4C5C74E99E2
    # 3 EAC159D6-7986-F42C-C5F5-35764544C105

    This takes a little pain to format it the same, a nicety at best. You might find small performance improvements by not clinging to this format.)

  • SQL Server requires temporarily creating a table (with newsequentialid()), generating a sequence into it, pulling the automatically-generated IDs, and discarding the table. A bit over-the-top, especially considering the ease of using SQLite for it, but YMMV. (No code offered, it doesn't add much.)

Other considerations

In addition to execution time and sufficient-randomness, there are various discussions around (uncited for now) with regards to database tables that indicate performance impacts by using non-consecutive UUIDs. This has to do with index pages and such, outside the scope of this answer.

However, assuming this is true ... with the assumption that rows inserted at around the same time (temporally correlated) are often grouped together (directly or sub-grouped), then it is a good thing to keep same-day data with UUID keys in the same db index-page, so V4 (completely random) UUIDs may decrease DB performance with large groups (and large tables). For this reason, I personally prefer V1 over V4.

Other (still uncited) discussions consider including a directly-traceable MAC address in the UUID to be a slight breach of internal information. For this reason, I personally lean towards V1mc over V1.

(But I don't yet have a way to do this well with RSQLite, so I'm reliant on having postgresql nearby. Fortunately, I use postgresql enough for other things that I keep an instance around with docker on windows.)

How do I create a GUID / UUID?

UUIDs (Universally Unique IDentifier), also known as GUIDs (Globally Unique IDentifier), according to RFC 4122, are identifiers designed to provide certain uniqueness guarantees.

While it is possible to implement RFC-compliant UUIDs in a few lines of JavaScript code (e.g., see @broofa's answer, below) there are several common pitfalls:

  • Invalid id format (UUIDs must be of the form "xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx", where x is one of [0-9, a-f] M is one of [1-5], and N is [8, 9, a, or b]
  • Use of a low-quality source of randomness (such as Math.random)

Thus, developers writing code for production environments are encouraged to use a rigorous, well-maintained implementation such as the uuid module.

How to specify a guid as row.names in R?

Read the data with your first attempt (no row.names=F)

users_for_dashboard_view <- read.delim("Neues Textdokument.txt", na.strings="NULL")
rownames(users_for_dashboard_view)<-users_for_dashboard_view$id
users_for_dashboard_view$id <- NULL


Related Topics



Leave a reply



Submit