how to create md5 hash of a column in R?
Package digest
absolutely suitable for this task, so firstly we load it:
library(digest)
Then create/load/etc. test data.frame
df
:
txt <-
"ID,VID
1,xyz-0001
2,abc-0987"
df <- read.table(header=T, text=txt, sep=",", stringsAsFactors=F)
df
The initial data looks like:
ID VID
1 1 xyz-0001
2 2 abc-0987
Then we can use function digest
with specified algorithm:
df$VID <- sapply(df$VID, digest, algo="md5")
df
Now we have hashed column VID
in df
:
ID VID
1 1 44e3a9cf85f802ef50f18e64e01c5e32
2 2 c576ff180b2046c1a3ae939766588fd3
MD5 file hash functions in R returning different values?
If you want to hash the contents of the file at that path, you need to tell each of the functions that. Try
digest("Downloads/pfd.RDS", file=TRUE, algo="md5")
and
md5(file("Downloads/pfd.RDS", open="rb"))
otherwise you are hashing the path name itself.
These return the same values in the simple case of
cat("hello", file="hello.txt")
digest("hello.txt", file=TRUE, algo="md5")
# [1] "5d41402abc4b2a76b9719d911017c592"
md5(file("hello.txt", open="rb"))
# md5 5d:41:40:2a:bc:4b:2a:76:b9:71:9d:91:10:17:c5:92
create hash value for each row of data in dataframe in R
If I get what you want properly, digest will work directly with apply:
library(digest)
ssi.10q3.v1.hash <- data.frame(uid = 1:nrow(ssi.10q3.v1), hash = apply(ssi.10q3.v1, 1, digest))
recursive file list with md5 in R
One tip might be to use the openssl
md5 function instead of digest
.
library(openssl)
md5s <- md5(file.names)
It's already vectorised so you won't need to use sapply which may improve your processing speed (depending on how big a file you want to hash).
In terms of cbind, it will keep the order of the first column you are binding to using your key (md5) so the output will have the order that file.names has.
One-way hash function in R
If you have the digest
pacakge installed, you can do
digest::digest("This is my input")
# [1] "2e936bb276abca8a9e46bd32c7bdc01e"
(by default the result is returned as ASCII hex values). See the ?digest
help page for a list of supported hashing algorithms.
New dataframe column as function (digest) of another one is not working for me
Considering you have a very large dataset, it's better to test the different approaches on a somewhat larger dataset (for this example I use 100000 rows, bigger datasets take ages on my system):
df <- data.frame(name = replicate(1e5, paste(sample(LETTERS, 20, replace=TRUE), collapse="")), stringsAsFactors=FALSE)
First, let's consider several approaches available:
# base R
df$md5 <- sapply(df$name, digest)
# data.table (grouping by name, based on the assumption that all names are unique)
dt[, md5:=digest(name), name]
# data.table with a unique identifier for each row
dt[,indx:=.I][, md5:=digest(name), indx]
# dplyr (grouping by name, based on the assumption that all names are unique)
df %>% group_by(name) %>% mutate(md5=digest(name))
# dplyr with rowwise (from the other answer)
df %>% rowwise() %>% mutate(md5=digest(name))
Second, test which appraoch is the fastest:
library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
baseR = df$md5 <- sapply(df$name, digest),
dtbl1 = dt[, md5:=digest(name), name],
dtbl2 = dt[,indx:=.I][, md5:=digest(name), indx],
dplyr = df %>% group_by(name) %>% mutate(md5=digest(name)),
rowwi = df %>% rowwise() %>% mutate(md5=digest(name)))
which gives:
test elapsed relative
2 dtbl1 77.878 1.000
3 dtbl2 78.343 1.006
1 baseR 81.399 1.045
5 rowwi 118.799 1.525
4 dplyr 129.748 1.666
So, sticking to a base R solution isn't such a bad choice at all. I suspect that the reason why it's slow on your real dataset is probably the digest
function and not some misbehavior of a certain package/function.
R hash a column does not work using digest
Since a hash algorithm doesn't care how much input you give him, it compresses in your case your whole column and not the single value. The digest
function is designed to hash whole columns/lists etc. It hashes all it can get. So just for verification let's input your whole column at once:
digest( c("1035656|8000|157.6|2018-12-10 00:00:00.0|2018-12-06 00:00:00.0", "1852231|460000|1748.0|2018-03-09 00:00:00.0|2018-03-07 00:00:00.0",
"3197282|6000|55.2|2019-01-18 00:00:00.0|2019-01-16 00:00:00.0", "1827398|396000|21859.2|2019-02-25 00:00:00.0|2019-02-21 00:00:00.0",
"1148967|60000|150.0|2018-10-15 00:00:00.0|2018-10-11 00:00:00.0"), algo="md5", serialize= F)
It gives get the output like in your example. Since there is just one return value, the column gets filled with the same value.
"d1ede7da2094651658adfd6171c33c52"
The solution is fairly simple, just use your hash on every row of the column like:
df$hash <-lapply(df$identifier, function(x) {digest(x, algo="md5", serialize = F)})
this gives the intendet output of:
identifier hash
1 1035656|8000|157.6|2018-12-10 00:00:00.0|2018-12-06 00:00:00.0 d1ede7da2094651658adfd6171c33c52
2 1852231|460000|1748.0|2018-03-09 00:00:00.0|2018-03-07 00:00:00.0 ca4caeac0a702094d51a13e67f23e56a
3 3197282|6000|55.2|2019-01-18 00:00:00.0|2019-01-16 00:00:00.0 239342dba0ec56f3b4200cb36046f2e0
4 1827398|396000|21859.2|2019-02-25 00:00:00.0|2019-02-21 00:00:00.0 54ea74e4344c14f8708dc47425ee1995
5 1148967|60000|150.0|2018-10-15 00:00:00.0|2018-10-11 00:00:00.0 f6bb25b0d7c1fbb65117d9403dadc7d2
Related Topics
Ggplot X-Axis Labels with All X-Axis Values
List of Word Frequencies Using R
R Tm Package Vcorpus: Error in Converting Corpus to Data Frame
Shinydashboard Some Font Awesome Icons Not Working
Efficiently Merging Two Data Frames on a Non-Trivial Criteria
How to Perform Pairwise Operation Like '%In%' and Set Operations for a List of Vectors
Create an R Package That Depends on Another R Package Located on Github
Fast Reading and Combining Several Files Using Data.Table (With Fread)
Create New Column Based on 4 Values in Another Column
Subsetting a Matrix by Row.Names
Unicode with Knitr and Rmarkdown
"Set Difference" Between Two Vectors with Duplicate Values
Changing Binary Variables to Yes/No
How to Find the Length of a String in R
Normalizing Y-Axis in Histograms in R Ggplot to Proportion by Group