How to Get the First 10 Words in a String in R

How to get the first 10 words in a string in R?

Here is an small function that unlist the strings, subsets the first ten words and then pastes it back together.

string_fun <- function(x) {
ul = unlist(strsplit(x, split = "\\s+"))[1:10]
paste(ul,collapse=" ")
}

string_fun(x)

df <- read.table(text = "Keyword,City(Column Header)
The length of the string should not be more than 10 is or are in,New York
The Keyword should be of specific length is or are in,Los Angeles
This is an experimental basis program string is or are in,Seattle
Please help me with getting only the first ten words is or are in,Boston", sep = ",", header = TRUE)

df <- as.data.frame(df)

Using apply (the function isn't doing anything in the second column)

df$Keyword <- apply(df[,1:2], 1, string_fun)

EDIT
Probably this is a more general way to use the function.

df[,1] <- as.character(df[,1])
df$Keyword <- unlist(lapply(df[,1], string_fun))

print(df)
# Keyword City.Column.Header.
# 1 The length of the string should not be more than New York
# 2 The Keyword should be of specific length is or are Los Angeles
# 3 This is an experimental basis program string is or Seattle
# 4 Please help me with getting only the first ten Boston

R: how to display the first n characters from a string of words

The other answers didn't eliminate the spaces as you did in your example, so I'll add this:

strsplit(substr(gsub("\\s+", "", Getty), 1, 10), '')[[1]]
#[1] "F" "o" "u" "r" "s" "c" "o" "r" "e" "a"

Extract the first (or last) n characters of a string

See ?substr

R> substr(a, 1, 4)
[1] "left"

R get N words from a sentence as a string

Pure stringi solution (stringr::word() is overkill and uses more stringi functions than this. stringr handicap-wraps stringi functions):

library(stringi)

sentence <- "The quick brown fox jumps over the lazy dog"

tail(stri_extract_all_words(sentence)[[1]], 2)
## [1] "lazy" "dog"

stri_join(tail(stri_extract_all_words(sentence)[[1]], 2), collapse=" ")
## [1] "lazy dog"

Actually readable version:

library(magrittr)

stri_extract_all_words(sentence)[[1]] %>%
tail(2) %>%
stri_join(collapse=" ")
## [1] "lazy dog"

It also uses a better, locale-sensitive word-break algorithm which is superior to base R's.

Using str_extract_all to extract only first two words in R?

Just relying on the stringr package.

library(stringr)

species_location<-c('Homo_sapiens_Lausanne_Switzerland', 'Solenopsis_invicta_California_US', 'Rattus_novaborensis_Copenhagen_Denmark', 'Candida_albicans_Crotch_Home')

word(species_location, 1,2, sep="_")

Extract the first 2 Characters in a string

You can just use the substr function directly to take the first two characters of each string:

x <- c("75 to 79", "80 to 84", "85 to 89")
substr(x, start = 1, stop = 2)
# [1] "75" "80" "85"

You could also write a simple function to do a "reverse" substring, giving the 'start' and 'stop' values assuming the index begins at the end of the string:

revSubstr <- function(x, start, stop) {
x <- strsplit(x, "")
sapply(x,
function(x) paste(rev(rev(x)[start:stop]), collapse = ""),
USE.NAMES = FALSE)
}
revSubstr(x, start = 1, stop = 2)
# [1] "79" "84" "89"

Extract first word from a column and insert into new column

You can use a regex ("([A-Za-z]+)" or "([[:alpha:]]+)"or "(\\w+)") to grab the first word

Dataframe1$COL2 <- gsub("([A-Za-z]+).*", "\\1", Dataframe1$COL1)

Extract words from a string

Use gsub() with a regular expression

x <- c("Resistance_Test DevID (Ohms) 428", "Diode_Test SUBLo (V) 353")
ptn <- "(.*? ){3}"
gsub(ptn, "", x)

[1] "428" "353"

This works because the regular expression (.*? ){3} finds exactly three {3} sets of characters followed by a space (.*? ), and then replaces this with ane empty string.

See ?gsub and ?regexp for more information.


If your data has structure that you don't mention in your question, then possibly the regular expression becomes even easier.

For example, if you are always interested in the last word of each line:

ptn <- "(.*? )"
gsub(ptn, "", x)

Or perhaps you know for sure you can only search for digits and discard everything else:

ptn <- "\\D"
gsub(ptn, "", x)

obtaining first word in the string

A very simple approach with gsub

gsub("/.*", '', y)
[1] "london" "newyork" "paris"

Getting and removing the first character of a string

See ?substring.

x <- 'hello stackoverflow'
substring(x, 1, 1)
## [1] "h"
substring(x, 2)
## [1] "ello stackoverflow"

The idea of having a pop method that both returns a value and has a side effect of updating the data stored in x is very much a concept from object-oriented programming. So rather than defining a pop function to operate on character vectors, we can make a reference class with a pop method.

PopStringFactory <- setRefClass(
"PopString",
fields = list(
x = "character"
),
methods = list(
initialize = function(x)
{
x <<- x
},
pop = function(n = 1)
{
if(nchar(x) == 0)
{
warning("Nothing to pop.")
return("")
}
first <- substring(x, 1, n)
x <<- substring(x, n + 1)
first
}
)
)

x <- PopStringFactory$new("hello stackoverflow")
x
## Reference class object of class "PopString"
## Field "x":
## [1] "hello stackoverflow"
replicate(nchar(x$x), x$pop())
## [1] "h" "e" "l" "l" "o" " " "s" "t" "a" "c" "k" "o" "v" "e" "r" "f" "l" "o" "w"


Related Topics



Leave a reply



Submit