How to remove unicode U+00A6 from string?
I just want to remove unicode
<U+00A6>
which is at the beginning of string.
Then you do not need a gsub
, you can use a sub
with "^\\s*<U\\+\\w+>\\s*"
pattern:
q <-"<U+00A6> 1000-66329"
sub("^\\s*<U\\+\\w+>\\s*", "", q)
Pattern details:
^
- start of string\\s*
- zero or more whitespaces<U\\+
- a literal char sequence<U+
\\w+
- 1 or more letters, digits or underscores>
- a literal>
\\s*
- zero or more whitespaces.
If you also need to replace the -
with a space, add |-
alternative and use gsub
(since now we expect several replacements and the replacement must be a space - same is in akrun's answer):
trimws(gsub("^\\s*<U\\+\\w+>|-", " ", q))
See the R online demo
How to replace ¦ (broken bar) from a string?
I think you're not assigning the result, and that's why you're not getting the desired output. Note that replace
and replaceAll
returns a new string, they doesn't modify the string in-place.
It should actually work. But if you have problems, keep in mind that you can use it directly:
String str = "sdfsdf¦sdfsdf"
System.out.println(str.replaceAll("¦", ""));
// Output: sdfsdfsdfsdf
Also there's no need for replaceAll
, you can use replace
instead (that doesn't accept a regex).
R - gsub to remove punctuation & numbers from string
Second question first, tyepof()
will always return list
for a data frame, because data frames are really just lists of equal length vectors.
For the first question, it appears you have some Unicode encoded characters in your data. One good way to take care of these is to convert them, perhaps like:
df$city <- iconv(df$city, 'utf-8', 'ascii', sub = '')
It is also possible to gsub
out characters on their hex code, like this:
df$city <- gsub('\u200B', '', df$city)
or even a range:
df$city <- gsub('[\u2000-\u20ff]', '', df$city)
But really I think the iconv
approach is the way to go. In this usage it will just remove the character rather than render it, but that seems to be what you want.
PHP function to convert unicode to special characters?
Try mb_convert_encoding()
with the "to" encoding as 'HTML-ENTITIES'
, and (if necessary) the "from" encoding set to 'UTF-8'
or whichever Unicode encoding you're using.
Possible to do a string replace with a dictionary?
preferred method using third-party module
A much better alternative than the method below is to use the awesome unidecode module:
>>> import unidecode
>>> somestring = u"äüÊÂ"
>>> unidecode.unidecode(somestring)
'auEA'
built-in, slightly-hazardous method
Inferring from your question that you are looking to normalize unicode characters, there is actually a nice, built-in way to do this:
>>> somestring = u"äüÊÂ"
>>> somestring
u'\xe4\xfc\xca\xc2'
>>> import unicodedata
>>> unicodedata.normalize('NFKD', somestring).encode('ascii', 'ignore')
'auEA'
Check out the documentation for unicodedata.normalize.
Note, however, that there might be some issues with this. See this post for a nice explanation and some workarounds.
See also, latin-1-to-ascii for alternatives.
U+00A0 special characters when reading a csv file
This should work for you:
df %>%
mutate(clean_gene = gsub("<([[:alpha:]][[:alnum:]]*)(.[^>]*)>([.^<]*)", "\\3", gene))
Note clean_gene
gene clean_gene
<chr> <chr>
1 IL-12A/IL-12B IL-12A/IL-12B
2 IL18R1 and IL18RAP IL18R1 and IL18RAP
3 <U+00A0>KLRK1 KLRK1
4 IFNG IFNG
5 NA NA
6 <U+00A0>KLRK1 KLRK1
7 <U+00A0>KLRK1 KLRK1
Edit:
To apply to a list of data.frame
s:
library(purrr)
library(dplyr)
list_of_dfs <- list_of_dfs %>%
map(~mutate(., gene = gsub("<([[:alpha:]][[:alnum:]]*)(.[^>]*)>([.^<]*)", "\\3", gene)))
Related Topics
Display Custom Image as Geom_Point
R Ggplot2 Merge with Shapefile and CSV Data to Fill Polygons
Drawing Pyramid Plot Using R and Ggplot2
Convert Seconds to Days: Hours:Minutes:Seconds
Speeding Up the Performance of Write.Table
When Importing CSV into R How to Generate Column with Name of the CSV
Collapse Rows with Overlapping Ranges
Dplyr If_Else() VS Base R Ifelse()
R Convert Zipcode or Lat/Long to County
Error in Grid.Call(L_Textbounds, As.Graphicsannot(X$Label), X$X, X$Y,:Polygon Edge Not Found
Add Nas to Make All List Elements Equal Length
Convert Currency with Commas into Numeric
How to Avoid Warning When Introducing Nas by Coercion
Why am I Getting X. in My Column Names When Reading a Data Frame