Chopping a String into a Vector of Fixed Width Character Elements

Chopping a string into a vector of fixed width character elements

Using substring is the best approach:

substring(x, seq(1, nchar(x), 2), seq(2, nchar(x), 2))

But here's a solution with plyr:

library("plyr")
laply(seq(1, nchar(x), 2), function(i) substr(x, i, i+1))

Splitting a string into fixed-size chunks

If you want speed, Rcpp is always a good choice:

library(Rcpp);
cppFunction('
List strsplitN(std::vector<std::string> v, int N ) {
if (N < 1) throw std::invalid_argument("N must be >= 1.");
List res(v.size());
for (int i = 0; i < v.size(); ++i) {
int num = v[i].size()/N + (v[i].size()%N == 0 ? 0 : 1);
std::vector<std::string> resCur(num,std::string(N,0));
for (int j = 0; j < num; ++j) resCur[j].assign(v[i].substr(j*N,N));
res[i] = resCur;
}
return res;
}
');

ch <- paste(rep('a',1e6),collapse='');
system.time({ res <- strsplitN(ch,2L); });
## user system elapsed
## 0.109 0.015 0.121
head(res[[1L]]); tail(res[[1L]]);
## [1] "aa" "aa" "aa" "aa" "aa" "aa"
## [1] "aa" "aa" "aa" "aa" "aa" "aa"
length(res[[1L]]);
## [1] 500000

Useful reference: http://gallery.rcpp.org/articles/strings_with_rcpp/.


More demos:

strsplitN(c('abcd','efgh'),2L);
## [[1]]
## [1] "ab" "cd"
##
## [[2]]
## [1] "ef" "gh"
##
strsplitN(c('abcd','efgh'),3L);
## [[1]]
## [1] "abc" "d"
##
## [[2]]
## [1] "efg" "h"
##
strsplitN(c('abcd','efgh'),1L);
## [[1]]
## [1] "a" "b" "c" "d"
##
## [[2]]
## [1] "e" "f" "g" "h"
##
strsplitN(c('abcd','efgh'),5L);
## [[1]]
## [1] "abcd"
##
## [[2]]
## [1] "efgh"
##
strsplitN(character(),5L);
## list()
strsplitN(c('abcd','efgh'),0L);
## Error: N must be >= 1.

There are two important caveats with the above implementation:

1: It doesn't handle NAs correctly. Rcpp seems to stringify to 'NA' when it's forced to come up with a std::string. You can easily solve this in Rland with a wrapper that replaces the offending list components with a true NA.

x <- c('a',NA); strsplitN(x,1L);
## [[1]]
## [1] "a"
##
## [[2]]
## [1] "N" "A"
##
x <- c('a',NA); ifelse(is.na(x),NA,strsplitN(x,1L));
## [[1]]
## [1] "a"
##
## [[2]]
## [1] NA
##

2: It doesn't handle multibyte characters correctly. This is a tougher problem, and would require a rewrite of the core function implementation to use a Unicode-aware traversal. Fixing this problem would also incur a significant performance penalty, since you wouldn't be able to preallocate each vector in one shot prior to the assignment loop.

strsplitN('aΩ',1L);
## [[1]]
## [1] "a" "\xce" "\xa9"
##
strsplit('aΩ','');
## [[1]]
## [1] "a" "Ω"
##

fastest way to split strings into fixed-length elements in R

We can split by specifying a regex lookbehind to match the position preceded by 'n' characters, For example, if we are splitting by 3 characters, we match the position/boundary preceded by 3 characters ((?<=.{3})).

splitInParts <- function(string, size){
pat <- paste0('(?<=.{',size,'})')
strsplit(string, pat, perl=TRUE)
}

splitInParts(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"

splitInParts(str1, 4)
#[[1]]
#[1] "azer" "tyui" "op"

splitInParts(str1, 5)
#[[1]]
#[1] "azert" "yuiop"

Or another approach is using stri_extract_all from library(stringi).

library(stringi)
splitInParts2 <- function(string, size){
pat <- paste0('.{1,', size, '}')
stri_extract_all_regex(string, pat)
}
splitInParts2(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"

stri_extract_all_regex(str1, '.{1,3}')

data

 str1 <- "azertyuiop"

How to split a string into substrings of a given length?

Here is one way

substring("aabbccccdd", seq(1, 9, 2), seq(2, 10, 2))
#[1] "aa" "bb" "cc" "cc" "dd"

or more generally

text <- "aabbccccdd"
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
#[1] "aa" "bb" "cc" "cc" "dd"

Edit: This is much, much faster

sst <- strsplit(text, "")[[1]]
out <- paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])

It first splits the string into characters. Then, it pastes together the even elements and the odd elements.

Timings

text <- paste(rep(paste0(letters, letters), 1000), collapse="")
g1 <- function(text) {
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
}
g2 <- function(text) {
sst <- strsplit(text, "")[[1]]
paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}
identical(g1(text), g2(text))
#[1] TRUE
library(rbenchmark)
benchmark(g1=g1(text), g2=g2(text))
# test replications elapsed relative user.self sys.self user.child sys.child
#1 g1 100 95.451 79.87531 95.438 0 0 0
#2 g2 100 1.195 1.00000 1.196 0 0 0

R How do i split a string into a vector so that each place in the vector corresponds to a letter

We need the split = ""

unlist(strsplit('abcde', ''))

How to bin a long continuous string into section of nth characters?

This is a little bit cheesy, but:

c(                             ## drop "omit" attribute
na.omit( ## drop NA values (from end)
unlist( ## collapse from data frame to vector
read.fwf( ## read fixed-width "file"
textConnection(x), ## treat string as a file
widths = rep(10, ## string width
1000 ## a 'big enough' value
)))))

Or if you like (in recent-ish versions of R that have |>)

(x
|> textConnection()
|> read.fwf(widths = rep(10, 1000))
|> unlist()
|> na.omit()
|> c()
)

R: How to split string into pieces

You can try with str_extract_all :

stringr::str_extract_all(x, '[A-Za-z_]+')[[1]]
[1] "CN" "Shandong" "Zibo" "ABCDEFGHIJK" "IMG_HAS"

With base R :

regmatches(x, gregexpr('[A-Za-z_]+', x))[[1]]

Here we extract all the words with upper, lower case or an underscore. Everything else is ignored so characters like �\\00? are not there in final output.

Right way to split an std::string into a vectorstring

For space separated strings, then you can do this:

std::string s = "What is the right way to split a string into a vector of strings";
std::stringstream ss(s);
std::istream_iterator<std::string> begin(ss);
std::istream_iterator<std::string> end;
std::vector<std::string> vstrings(begin, end);
std::copy(vstrings.begin(), vstrings.end(), std::ostream_iterator<std::string>(std::cout, "\n"));

Output:

What
is
the
right
way
to
split
a
string
into
a
vector
of
strings


string that have both comma and space

struct tokens: std::ctype<char> 
{
tokens(): std::ctype<char>(get_table()) {}

static std::ctype_base::mask const* get_table()
{
typedef std::ctype<char> cctype;
static const cctype::mask *const_rc= cctype::classic_table();

static cctype::mask rc[cctype::table_size];
std::memcpy(rc, const_rc, cctype::table_size * sizeof(cctype::mask));

rc[','] = std::ctype_base::space;
rc[' '] = std::ctype_base::space;
return &rc[0];
}
};

std::string s = "right way, wrong way, correct way";
std::stringstream ss(s);
ss.imbue(std::locale(std::locale(), new tokens()));
std::istream_iterator<std::string> begin(ss);
std::istream_iterator<std::string> end;
std::vector<std::string> vstrings(begin, end);
std::copy(vstrings.begin(), vstrings.end(), std::ostream_iterator<std::string>(std::cout, "\n"));

Output:

right
way
wrong
way
correct
way

Convert a Character String with commas into a vector in R

We can use scan

newvec <- scan(text = text, what ="", sep=",", strip.white = TRUE, quiet = TRUE)
newvec
#[1] "MY" "NAME" "IS" "SLIM" "SHADY"

Or with strsplit

unlist(strsplit(text, ",\\s*"))

Collapse character vector into single string with each string on its own row

There is no way to create an object that stores data in the way that you want except when you parse the \n characters. Whitespace in strings are meaningful, i.e. spaces will be preserved when a string is printed. However, how those whitespaces appear in your console will vary for a number of reasons. As it is now, your string already has the characters to tell other functions that will process it to print each substring on its own line.

text <- c("This should be first row",
"This should be second row",
"This should be third row")

# This is what you did
# Notice that you can't assign to an object and it prints to console!
test <- cat(paste(text, collapse = "\n"))
#> This should be first row
#> This should be second row
#> This should be third row

print(test)
#> NULL

# This is what you want: sending the object itself
test2 <- paste(text, collapse = "\n")

print(test2)
#> [1] "This should be first row\nThis should be second row\nThis should be third row"


Related Topics



Leave a reply



Submit