How to Remove Unicode <U+00A6> from String

How to remove unicode U+00A6 from string?

I just want to remove unicode <U+00A6> which is at the beginning of string.

Then you do not need a gsub, you can use a sub with "^\\s*<U\\+\\w+>\\s*" pattern:

q <-"<U+00A6>  1000-66329"
sub("^\\s*<U\\+\\w+>\\s*", "", q)

Pattern details:

^ - start of string
\\s* - zero or more whitespaces
<U\\+ - a literal char sequence <U+
\\w+ - 1 or more letters, digits or underscores
> - a literal >
\\s* - zero or more whitespaces.

If you also need to replace the - with a space, add |- alternative and use gsub (since now we expect several replacements and the replacement must be a space - same is in akrun's answer):

trimws(gsub("^\\s*<U\\+\\w+>|-", " ", q))

See the R online demo

How to replace ¦ (broken bar) from a string?

I think you're not assigning the result, and that's why you're not getting the desired output. Note that replace and replaceAll returns a new string, they doesn't modify the string in-place.

It should actually work. But if you have problems, keep in mind that you can use it directly:

String str = "sdfsdf¦sdfsdf"
System.out.println(str.replaceAll("¦", ""));
// Output: sdfsdfsdfsdf

Also there's no need for replaceAll, you can use replace instead (that doesn't accept a regex).

R - gsub to remove punctuation & numbers from string

Second question first, tyepof() will always return list for a data frame, because data frames are really just lists of equal length vectors.

For the first question, it appears you have some Unicode encoded characters in your data. One good way to take care of these is to convert them, perhaps like:

df$city <- iconv(df$city, 'utf-8', 'ascii', sub = '')

It is also possible to gsub out characters on their hex code, like this:

df$city <- gsub('\u200B', '', df$city)

or even a range:

df$city <- gsub('[\u2000-\u20ff]', '', df$city)

But really I think the iconv approach is the way to go. In this usage it will just remove the character rather than render it, but that seems to be what you want.

PHP function to convert unicode to special characters?

Try mb_convert_encoding() with the "to" encoding as 'HTML-ENTITIES', and (if necessary) the "from" encoding set to 'UTF-8' or whichever Unicode encoding you're using.

Possible to do a string replace with a dictionary?

preferred method using third-party module

A much better alternative than the method below is to use the awesome unidecode module:

>>> import unidecode
>>> somestring = u"äüÊÂ"
>>> unidecode.unidecode(somestring)
'auEA'

built-in, slightly-hazardous method

Inferring from your question that you are looking to normalize unicode characters, there is actually a nice, built-in way to do this:

>>> somestring = u"äüÊÂ"
>>> somestring
u'\xe4\xfc\xca\xc2'
>>> import unicodedata
>>> unicodedata.normalize('NFKD', somestring).encode('ascii', 'ignore')
'auEA'

Check out the documentation for unicodedata.normalize.

Note, however, that there might be some issues with this. See this post for a nice explanation and some workarounds.

See also, latin-1-to-ascii for alternatives.

U+00A0 special characters when reading a csv file

This should work for you:

df %>% 
  mutate(clean_gene = gsub("<([[:alpha:]][[:alnum:]]*)(.[^>]*)>([.^<]*)", "\\3", gene))

Note clean_gene

gene               clean_gene        
   <chr>              <chr>             
 1 IL-12A/IL-12B      IL-12A/IL-12B     
 2 IL18R1 and IL18RAP IL18R1 and IL18RAP
 3 <U+00A0>KLRK1      KLRK1             
 4 IFNG               IFNG              
 5 NA                 NA                
 6 <U+00A0>KLRK1      KLRK1             
 7 <U+00A0>KLRK1      KLRK1

Edit:

To apply to a list of data.frames:

library(purrr)
library(dplyr)

list_of_dfs <- list_of_dfs %>% 
  map(~mutate(., gene = gsub("<([[:alpha:]][[:alnum:]]*)(.[^>]*)>([.^<]*)", "\\3", gene)))

How to Remove Unicode <U+00A6> from String