Extracting Unique Numbers from String in R

Extracting unique numbers from string in R

For the second answer, you can use gsub to remove everything from the string that's not a number, then split the string as follows:

unique(as.numeric(unlist(strsplit(gsub("[^0-9]", "", unlist(ll)), ""))))
# [1] 7 6 1 5 2

For the first answer, similarly using strsplit,

unique(na.omit(as.numeric(unlist(strsplit(unlist(ll), "[^0-9]+")))))
# [1] 7 667 11 5 2

PS: don't name your variable list (as there's an inbuilt function list). I've named your data as ll.

Extracting numbers from vectors of strings

How about

# pattern is by finding a set of numbers in the start and capturing them
as.numeric(gsub("([0-9]+).*$", "\\1", years))

or

# pattern is to just remove _years_old
as.numeric(gsub(" years old", "", years))

or

# split by space, get the element in first index
as.numeric(sapply(strsplit(years, " "), "[[", 1))

How to search for and extract unique values from one column in another column?

I think this works for you:

mutate(df, Col_C = stringr::str_extract(
Col_A,
paste0("\\b(", paste0(unique(Col_B), collapse = "|"), ")\\b")))
# Col_A Col_B Col_C
# 1 blue shovel 1024 blue blue
# 2 red shovel 1022 red red
# 3 green bucket 3021 green green
# 4 green rake 3021 blue green
# 5 yellow shovel 1023 yellow yellow

Breakdown:

  • paste0(unique(Col_B), collapse="|") takes the words in Col_B, de-duplicates it, and concatenates them all together with | symbols; that is, c("blue","red","green") --> "blue|red|green". In regex, the | symbol is an "OR" operator.
  • \\b( and )\\b are word-boundaries, meaning that there isn't a word-like character immediately before (first) or after (second) the patterns; by adding this around the words, we prevent a partial match of blu on blue (in case that ever happens); while it is not apparent that this changes anything here, it's a more defensive/specific pattern. The parens add grouping, more evident in the next bullet.
  • With all of that, our overall pattern looks something like "\\b(blue|red|green)\\b" (abbreviated). This translates into "find blue or red or green such that there is a word-boundary on both ends of whichever one(s) you find".

Trying to extract/count the unique characters in a string (of class character)

In base R you can do:

df$char_count <- sapply(strsplit(df$Text, ""), function(x) length(unique(x)))

df
#> Text char_count
#> 1 banana 3
#> 2 banana12 5
#> 3 Ace@343 6

Data

df <- data.frame(Text = c("banana", "banana12", "Ace@343"))

Created on 2021-11-12 by the reprex package (v2.0.0)

Extract unique numbers from a list with multiple items per line using gsub()?

You can use

v <- list(c("12", "1"), c("13", "1"), c("12", "3"))
unique(sapply(v, "[[", 1))
# => [1] "12" "13"

See the R demo online.

Note:

  • sapply(v, "[[", 1) - gets the first items
  • unique leaves only the unique values.

How to extract numbers from text?

We can use str_extract_all by specifying the pattern as one or more number ([0-9]+). The output will be a list of length 1, extract the vector with [[ and convert to numeric.

library(stringr)
as.numeric(str_extract_all(string, "[0-9]+")[[1]])
#[1] 2016 81 64 2017 18 36

If we are using strsplit, split by the non-numeric characters

as.numeric(strsplit(string, "\\D+")[[1]][-1])
#[1] 2016 81 64 2017 18 36

Extracting unique partial elements from vector

stringr also has the str_extract function, which can be used to extract substrings that match a regex pattern. With a positive lookbehind for / and a positive lookahead for _, you can achieve your aim.

Beginning with @Andrie's x:

str_extract(x, perl('(?<=/)\\d+(?=_)'))

# [1] NA "4101" "4101" "4101" "4101" "4101" "4101" "4101" "4101" "4101"

The pattern above matches one or more numerals (i.e. \\d+) that are preceded by a forward slash and followed by an underscore. Wrapping the pattern in perl() is required for the lookarounds.



Related Topics



Leave a reply



Submit