How to Create Md5 Hash of a Column in R

how to create md5 hash of a column in R?

Package digest absolutely suitable for this task, so firstly we load it:

library(digest)

Then create/load/etc. test data.frame df:

txt <-
"ID,VID
1,xyz-0001
2,abc-0987"

df <- read.table(header=T, text=txt, sep=",", stringsAsFactors=F)
df

The initial data looks like:

  ID      VID
1 1 xyz-0001
2 2 abc-0987

Then we can use function digest with specified algorithm:

df$VID <- sapply(df$VID, digest, algo="md5")
df

Now we have hashed column VID in df:

  ID                              VID
1 1 44e3a9cf85f802ef50f18e64e01c5e32
2 2 c576ff180b2046c1a3ae939766588fd3

MD5 file hash functions in R returning different values?

If you want to hash the contents of the file at that path, you need to tell each of the functions that. Try

digest("Downloads/pfd.RDS", file=TRUE, algo="md5")

and

md5(file("Downloads/pfd.RDS", open="rb"))

otherwise you are hashing the path name itself.

These return the same values in the simple case of

cat("hello", file="hello.txt")
digest("hello.txt", file=TRUE, algo="md5")
# [1] "5d41402abc4b2a76b9719d911017c592"
md5(file("hello.txt", open="rb"))
# md5 5d:41:40:2a:bc:4b:2a:76:b9:71:9d:91:10:17:c5:92

create hash value for each row of data in dataframe in R

If I get what you want properly, digest will work directly with apply:

library(digest)
ssi.10q3.v1.hash <- data.frame(uid = 1:nrow(ssi.10q3.v1), hash = apply(ssi.10q3.v1, 1, digest))

recursive file list with md5 in R

One tip might be to use the openssl md5 function instead of digest.

library(openssl)

md5s <- md5(file.names)

It's already vectorised so you won't need to use sapply which may improve your processing speed (depending on how big a file you want to hash).

In terms of cbind, it will keep the order of the first column you are binding to using your key (md5) so the output will have the order that file.names has.

One-way hash function in R

If you have the digest pacakge installed, you can do

digest::digest("This is my input")
# [1] "2e936bb276abca8a9e46bd32c7bdc01e"

(by default the result is returned as ASCII hex values). See the ?digest help page for a list of supported hashing algorithms.

New dataframe column as function (digest) of another one is not working for me

Considering you have a very large dataset, it's better to test the different approaches on a somewhat larger dataset (for this example I use 100000 rows, bigger datasets take ages on my system):

df <- data.frame(name = replicate(1e5, paste(sample(LETTERS, 20, replace=TRUE), collapse="")), stringsAsFactors=FALSE)

First, let's consider several approaches available:

# base R
df$md5 <- sapply(df$name, digest)

# data.table (grouping by name, based on the assumption that all names are unique)
dt[, md5:=digest(name), name]

# data.table with a unique identifier for each row
dt[,indx:=.I][, md5:=digest(name), indx]

# dplyr (grouping by name, based on the assumption that all names are unique)
df %>% group_by(name) %>% mutate(md5=digest(name))

# dplyr with rowwise (from the other answer)
df %>% rowwise() %>% mutate(md5=digest(name))

Second, test which appraoch is the fastest:

library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
baseR = df$md5 <- sapply(df$name, digest),
dtbl1 = dt[, md5:=digest(name), name],
dtbl2 = dt[,indx:=.I][, md5:=digest(name), indx],
dplyr = df %>% group_by(name) %>% mutate(md5=digest(name)),
rowwi = df %>% rowwise() %>% mutate(md5=digest(name)))

which gives:

   test elapsed relative
2 dtbl1 77.878 1.000
3 dtbl2 78.343 1.006
1 baseR 81.399 1.045
5 rowwi 118.799 1.525
4 dplyr 129.748 1.666

So, sticking to a base R solution isn't such a bad choice at all. I suspect that the reason why it's slow on your real dataset is probably the digest function and not some misbehavior of a certain package/function.

R hash a column does not work using digest

Since a hash algorithm doesn't care how much input you give him, it compresses in your case your whole column and not the single value. The digest function is designed to hash whole columns/lists etc. It hashes all it can get. So just for verification let's input your whole column at once:

digest( c("1035656|8000|157.6|2018-12-10 00:00:00.0|2018-12-06 00:00:00.0", "1852231|460000|1748.0|2018-03-09 00:00:00.0|2018-03-07 00:00:00.0",
"3197282|6000|55.2|2019-01-18 00:00:00.0|2019-01-16 00:00:00.0", "1827398|396000|21859.2|2019-02-25 00:00:00.0|2019-02-21 00:00:00.0",
"1148967|60000|150.0|2018-10-15 00:00:00.0|2018-10-11 00:00:00.0"), algo="md5", serialize= F)

It gives get the output like in your example. Since there is just one return value, the column gets filled with the same value.

 "d1ede7da2094651658adfd6171c33c52"

The solution is fairly simple, just use your hash on every row of the column like:

df$hash <-lapply(df$identifier, function(x) {digest(x, algo="md5", serialize = F)})

this gives the intendet output of:

   identifier                                                          hash
1 1035656|8000|157.6|2018-12-10 00:00:00.0|2018-12-06 00:00:00.0 d1ede7da2094651658adfd6171c33c52
2 1852231|460000|1748.0|2018-03-09 00:00:00.0|2018-03-07 00:00:00.0 ca4caeac0a702094d51a13e67f23e56a
3 3197282|6000|55.2|2019-01-18 00:00:00.0|2019-01-16 00:00:00.0 239342dba0ec56f3b4200cb36046f2e0
4 1827398|396000|21859.2|2019-02-25 00:00:00.0|2019-02-21 00:00:00.0 54ea74e4344c14f8708dc47425ee1995
5 1148967|60000|150.0|2018-10-15 00:00:00.0|2018-10-11 00:00:00.0 f6bb25b0d7c1fbb65117d9403dadc7d2


Related Topics



Leave a reply



Submit