Removing "Nul" Characters (Within R)

Removing NUL characters (within R)

You want to read the file as binary then you can substitute the NULs, e.g. to replace them by spaces:

r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 = <space>
writeBin(r, "00staff.txt")
str(readLines("00staff.txt"))
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__ ...

You could also substitute the NULs with a really rare character (such as "\01") and work on the string in place, e.g., let's say if you want to replace two NULs ("\00\00") with one space:

r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(1)
a = gsub("\01\01", " ", rawToChar(r), fixed=TRUE)
s = strsplit(a, "\n", TRUE)[[1]]
str(s)
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__

find_replace nul character in R

One way is to use gsub() within eval(parse(text = ...)):

dat <- data.frame(path = c("X:\01_aim\01_seq.R", "X:\01_aim\02_seq.R", "X:\01_aim\03_seq.R", "X:\01_aim\04_seq.R"), 
dat = c("data1.csv", "data2.csv", "data1.csv", "data2.csv"))

temp <- eval(parse(text= gsub("\\", "/", deparse(dat$path), fixed=TRUE)))
gsub("X:", "", temp)

#> [1] "/001_aim/001_seq.R" "/001_aim/002_seq.R" "/001_aim/003_seq.R"
#> [4] "/001_aim/004_seq.R"

Created on 2021-08-23 by the reprex package (v2.0.1)

Another way is to escape the strings containing one backslash using stringi::stri_escape_unicode. Since the string is converted to unicode before being escaped this adds an unwanted u0 after each pair of backslashs. We can then use gsub("\\\\u0", "/") to get the desired file path.

dat <- data.frame(path = c("X:\01_aim\01_seq.R", "X:\01_aim\02_seq.R", "X:\01_aim\03_seq.R"), 
dat = c("data1.csv", "data2.csv", "data1.csv"))

temp <- gsub("X:", "", stringi::stri_escape_unicode(dat$path))
gsub("\\\\u0", "/", temp)
#> [1] "/001_aim/001_seq.R" "/001_aim/002_seq.R" "/001_aim/003_seq.R"

Created on 2021-08-23 by the reprex package (v2.0.1)

Removing NUL characters

This might help, I used to fi my files like this:
http://security102.blogspot.ru/2010/04/findreplace-of-nul-objects-in-notepad.html

Basically you need to replace \x00 characters with regular expressions

Sample Image

R: removing NULL elements from a list

The closest you'll be able to get is to first name the list elements and then remove the NULLs.

names(x) <- seq_along(x)

## Using some higher-order convenience functions
Filter(Negate(is.null), x)
# $`11`
# [1] 123
#
# $`13`
# [1] 456

# Or, using a slightly more standard R idiom
x[sapply(x, is.null)] <- NULL
x
# $`11`
# [1] 123
#
# $`13`
# [1] 456

remove null character from string

You can remove \x00 runes from a string the same way you can remove any other runes:

valueStr = strings.Replace(valueStr, "\x00", "", -1)

Example:

s := "a\x00b"
fmt.Printf("%q\n", s)
s = strings.Replace(s, "\x00", "", -1)
fmt.Printf("%q\n", s)

Output (try it on the Go Playground):

"a\x00b"
"ab"

Using strings.Replacer

Also note that you can substitute the multiple replaces with a single operation by using strings.Replacer, and it will also be more efficient as it only iterates over the input once (and there will be only one string allocated for the result, no matter how many substrings you want to replace).

For example:

s := " \t\n\rabc\x00"
fmt.Printf("%q\n", s)

r := strings.NewReplacer(" ", "", "\t", "", "\n", "", "\r", "", "\x00", "")
s = r.Replace(s)
fmt.Printf("%q\n", s)

Output (try it on the Go Playground):

" \t\n\rabc\x00"
"abc"

Also note that it's enough to create a string.Replacer once, and you can store it in a (global) variable and reuse it, it is even safe to use it concurrently from multiple goroutines.

Using strings.Map()

Also note that if you only want to replace (remove) single runes and not multi-rune (or multi-byte) substrings, you can also use strings.Map() which might be even more efficient than strings.Replacer.

First define a function that tells which runes to replace (or remove if you return a negative value):

func remove(r rune) rune {
switch r {
case ' ', '\t', '\n', '\r', 0:
return -1
}
return r
}

And then using it:

s := " \t\n\rabc\x00"
fmt.Printf("%q\n", s)

s = strings.Map(remove, s)
fmt.Printf("%q\n", s)

Output (try it on the Go Playground):

" \t\n\rabc\x00"
"abc"

Benchmarks

We might think strings.Map() will be superior as it only have to deal with runes which are just int32 numbers, while strings.Replacer have to deal with string values which are headers (length+data pointer) plus a series of bytes.

But we should know that string values are stored as UTF-8 byte sequences in memory, which means strings.Map() have to decode the runes from the UTF-8 byte sequence (and encode the runes back to UTF-8 in the end), while strings.Replacer does not: it may simply look for byte sequence matches without decoding the runes. And strings.Replacer is highly optimized to take advantage of such "tricks".

So let's create a benchmark to compare them:

We'll use these for the benchmarks:

var r = strings.NewReplacer(" ", "", "\t", "", "\n", "", "\r", "", "\x00", "")

func remove(r rune) rune {
switch r {
case ' ', '\t', '\n', '\r', 0:
return -1
}
return r
}

And we run benchmarks on different input strings:

func BenchmarkReplaces(b *testing.B) {
cases := []struct {
title string
input string
}{
{
title: "None",
input: "abc",
},
{
title: "Normal",
input: " \t\n\rabc\x00",
},
{
title: "Long",
input: "adsfWR \t\rab\nc\x00 \t\n\rabc\x00asdfWER\n\r",
},
}

for _, c := range cases {
b.Run("Replacer-"+c.title, func(b *testing.B) {
for i := 0; i < b.N; i++ {
r.Replace(c.input)
}
})
b.Run("Map-"+c.title, func(b *testing.B) {
for i := 0; i < b.N; i++ {
strings.Map(remove, c.input)
}
})
}

}

And now let's see the benchmark results:

BenchmarkReplaces/Replacer-None-4    100000000   12.3 ns/op    0 B/op  0 allocs/op
BenchmarkReplaces/Map-None-4 100000000 16.1 ns/op 0 B/op 0 allocs/op
BenchmarkReplaces/Replacer-Normal-4 20000000 92.7 ns/op 6 B/op 2 allocs/op
BenchmarkReplaces/Map-Normal-4 20000000 92.4 ns/op 16 B/op 2 allocs/op
BenchmarkReplaces/Replacer-Long-4 5000000 234 ns/op 64 B/op 2 allocs/op
BenchmarkReplaces/Map-Long-4 5000000 235 ns/op 80 B/op 2 allocs/op

Despite expectations, string.Replacer performs pretty good, just as good as strings.Map() due to it not having to decode and encode runes.

Is there a way to replace a character in a vector with a NULL value in R?

It's worth remembering the difference between NULL and NA. NA values are a dodgy value, NULL is no value whatsoever. In order to get the second output to be the same as the first output, you would have something the same as the following

column <- c("None", "Some", "NULL", "Many", "All")
column <- column[column != "NULL"]

This creates a shorter vector, which is why str_replace doesn't like it.

Replace all string instances of NULL with actual NULL or NA in a data frame

Just do this:

exampledf[exampledf=="NULL"] <- NA

or with dplyr

exampledf <- exampledf %>% replace(exampledf == "NULL", NA)


Related Topics



Leave a reply



Submit