Chopping a string into a vector of fixed width character elements
Using substring
is the best approach:
substring(x, seq(1, nchar(x), 2), seq(2, nchar(x), 2))
But here's a solution with plyr:
library("plyr")
laply(seq(1, nchar(x), 2), function(i) substr(x, i, i+1))
Splitting a string into fixed-size chunks
If you want speed, Rcpp
is always a good choice:
library(Rcpp);
cppFunction('
List strsplitN(std::vector<std::string> v, int N ) {
if (N < 1) throw std::invalid_argument("N must be >= 1.");
List res(v.size());
for (int i = 0; i < v.size(); ++i) {
int num = v[i].size()/N + (v[i].size()%N == 0 ? 0 : 1);
std::vector<std::string> resCur(num,std::string(N,0));
for (int j = 0; j < num; ++j) resCur[j].assign(v[i].substr(j*N,N));
res[i] = resCur;
}
return res;
}
');
ch <- paste(rep('a',1e6),collapse='');
system.time({ res <- strsplitN(ch,2L); });
## user system elapsed
## 0.109 0.015 0.121
head(res[[1L]]); tail(res[[1L]]);
## [1] "aa" "aa" "aa" "aa" "aa" "aa"
## [1] "aa" "aa" "aa" "aa" "aa" "aa"
length(res[[1L]]);
## [1] 500000
Useful reference: http://gallery.rcpp.org/articles/strings_with_rcpp/.
More demos:
strsplitN(c('abcd','efgh'),2L);
## [[1]]
## [1] "ab" "cd"
##
## [[2]]
## [1] "ef" "gh"
##
strsplitN(c('abcd','efgh'),3L);
## [[1]]
## [1] "abc" "d"
##
## [[2]]
## [1] "efg" "h"
##
strsplitN(c('abcd','efgh'),1L);
## [[1]]
## [1] "a" "b" "c" "d"
##
## [[2]]
## [1] "e" "f" "g" "h"
##
strsplitN(c('abcd','efgh'),5L);
## [[1]]
## [1] "abcd"
##
## [[2]]
## [1] "efgh"
##
strsplitN(character(),5L);
## list()
strsplitN(c('abcd','efgh'),0L);
## Error: N must be >= 1.
There are two important caveats with the above implementation:
1: It doesn't handle NA
s correctly. Rcpp seems to stringify to 'NA'
when it's forced to come up with a std::string
. You can easily solve this in Rland with a wrapper that replaces the offending list components with a true NA
.
x <- c('a',NA); strsplitN(x,1L);
## [[1]]
## [1] "a"
##
## [[2]]
## [1] "N" "A"
##
x <- c('a',NA); ifelse(is.na(x),NA,strsplitN(x,1L));
## [[1]]
## [1] "a"
##
## [[2]]
## [1] NA
##
2: It doesn't handle multibyte characters correctly. This is a tougher problem, and would require a rewrite of the core function implementation to use a Unicode-aware traversal. Fixing this problem would also incur a significant performance penalty, since you wouldn't be able to preallocate each vector in one shot prior to the assignment loop.
strsplitN('aΩ',1L);
## [[1]]
## [1] "a" "\xce" "\xa9"
##
strsplit('aΩ','');
## [[1]]
## [1] "a" "Ω"
##
fastest way to split strings into fixed-length elements in R
We can split
by specifying a regex lookbehind to match the position preceded by 'n' characters, For example, if we are splitting by 3 characters, we match the position/boundary preceded by 3 characters ((?<=.{3})
).
splitInParts <- function(string, size){
pat <- paste0('(?<=.{',size,'})')
strsplit(string, pat, perl=TRUE)
}
splitInParts(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"
splitInParts(str1, 4)
#[[1]]
#[1] "azer" "tyui" "op"
splitInParts(str1, 5)
#[[1]]
#[1] "azert" "yuiop"
Or another approach is using stri_extract_all
from library(stringi)
.
library(stringi)
splitInParts2 <- function(string, size){
pat <- paste0('.{1,', size, '}')
stri_extract_all_regex(string, pat)
}
splitInParts2(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"
stri_extract_all_regex(str1, '.{1,3}')
data
str1 <- "azertyuiop"
How to split a string into substrings of a given length?
Here is one way
substring("aabbccccdd", seq(1, 9, 2), seq(2, 10, 2))
#[1] "aa" "bb" "cc" "cc" "dd"
or more generally
text <- "aabbccccdd"
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
#[1] "aa" "bb" "cc" "cc" "dd"
Edit: This is much, much faster
sst <- strsplit(text, "")[[1]]
out <- paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
It first splits the string into characters. Then, it pastes together the even elements and the odd elements.
Timings
text <- paste(rep(paste0(letters, letters), 1000), collapse="")
g1 <- function(text) {
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
}
g2 <- function(text) {
sst <- strsplit(text, "")[[1]]
paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}
identical(g1(text), g2(text))
#[1] TRUE
library(rbenchmark)
benchmark(g1=g1(text), g2=g2(text))
# test replications elapsed relative user.self sys.self user.child sys.child
#1 g1 100 95.451 79.87531 95.438 0 0 0
#2 g2 100 1.195 1.00000 1.196 0 0 0
R How do i split a string into a vector so that each place in the vector corresponds to a letter
We need the split = ""
unlist(strsplit('abcde', ''))
How to bin a long continuous string into section of nth characters?
This is a little bit cheesy, but:
c( ## drop "omit" attribute
na.omit( ## drop NA values (from end)
unlist( ## collapse from data frame to vector
read.fwf( ## read fixed-width "file"
textConnection(x), ## treat string as a file
widths = rep(10, ## string width
1000 ## a 'big enough' value
)))))
Or if you like (in recent-ish versions of R that have |>
)
(x
|> textConnection()
|> read.fwf(widths = rep(10, 1000))
|> unlist()
|> na.omit()
|> c()
)
R: How to split string into pieces
You can try with str_extract_all
:
stringr::str_extract_all(x, '[A-Za-z_]+')[[1]]
[1] "CN" "Shandong" "Zibo" "ABCDEFGHIJK" "IMG_HAS"
With base R :
regmatches(x, gregexpr('[A-Za-z_]+', x))[[1]]
Here we extract all the words with upper, lower case or an underscore. Everything else is ignored so characters like �\\00?
are not there in final output.
Right way to split an std::string into a vectorstring
For space separated strings, then you can do this:
std::string s = "What is the right way to split a string into a vector of strings";
std::stringstream ss(s);
std::istream_iterator<std::string> begin(ss);
std::istream_iterator<std::string> end;
std::vector<std::string> vstrings(begin, end);
std::copy(vstrings.begin(), vstrings.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
Output:
What
is
the
right
way
to
split
a
string
into
a
vector
of
strings
string that have both comma and space
struct tokens: std::ctype<char>
{
tokens(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table()
{
typedef std::ctype<char> cctype;
static const cctype::mask *const_rc= cctype::classic_table();
static cctype::mask rc[cctype::table_size];
std::memcpy(rc, const_rc, cctype::table_size * sizeof(cctype::mask));
rc[','] = std::ctype_base::space;
rc[' '] = std::ctype_base::space;
return &rc[0];
}
};
std::string s = "right way, wrong way, correct way";
std::stringstream ss(s);
ss.imbue(std::locale(std::locale(), new tokens()));
std::istream_iterator<std::string> begin(ss);
std::istream_iterator<std::string> end;
std::vector<std::string> vstrings(begin, end);
std::copy(vstrings.begin(), vstrings.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
Output:
right
way
wrong
way
correct
way
Convert a Character String with commas into a vector in R
We can use scan
newvec <- scan(text = text, what ="", sep=",", strip.white = TRUE, quiet = TRUE)
newvec
#[1] "MY" "NAME" "IS" "SLIM" "SHADY"
Or with strsplit
unlist(strsplit(text, ",\\s*"))
Collapse character vector into single string with each string on its own row
There is no way to create an object that stores data in the way that you want except when you parse the \n
characters. Whitespace in strings are meaningful, i.e. spaces will be preserved when a string is printed. However, how those whitespaces appear in your console will vary for a number of reasons. As it is now, your string already has the characters to tell other functions that will process it to print each substring on its own line.
text <- c("This should be first row",
"This should be second row",
"This should be third row")
# This is what you did
# Notice that you can't assign to an object and it prints to console!
test <- cat(paste(text, collapse = "\n"))
#> This should be first row
#> This should be second row
#> This should be third row
print(test)
#> NULL
# This is what you want: sending the object itself
test2 <- paste(text, collapse = "\n")
print(test2)
#> [1] "This should be first row\nThis should be second row\nThis should be third row"
Related Topics
Read Multiple CSV Files into Separate Data Frames
Summarizing Multiple Columns With Data.Table
Dplyr::Select Function Clashes With Mass::Select
R on Macos Error: Vector Memory Exhausted (Limit Reached)
Replace Na in Column With Value in Adjacent Column
How to Calculate Mean/Median Per Group in a Dataframe in R
Dplyr Mutate/Replace Several Columns on a Subset of Rows
Unique on a Dataframe With Only Selected Columns
Removing Empty Rows of a Data File in R
Replace/Translate Characters in a String
Method to Extract Stat_Smooth Line Fit
Overlay Normal Curve to Histogram in R
Merging Two Data Frames Using Fuzzy/Approximate String Matching in R
Dplyr Mutate Rowsums Calculations or Custom Functions
Dummy Variables from a String Variable