Replace Unicode escape sequences in a string
You could use a regular expression to parse the file:
private static Regex _regex = new Regex(@"\\u(?<Value>[a-zA-Z0-9]{4})", RegexOptions.Compiled);
public string Decoder(string value)
{
return _regex.Replace(
value,
m => ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString()
);
}
And then:
string data = Decoder(File.ReadAllText("test.txt"));
How to replace unicode characters in string with something else python?
Decode the string to Unicode. Assuming it's UTF-8-encoded:
str.decode("utf-8")
Call the
replace
method and be sure to pass it a Unicode string as its first argument:str.decode("utf-8").replace(u"\u2022", "*")
Encode back to UTF-8, if needed:
str.decode("utf-8").replace(u"\u2022", "*").encode("utf-8")
(Fortunately, Python 3 puts a stop to this mess. Step 3 should really only be performed just prior to I/O. Also, mind you that calling a string str
shadows the built-in type str
.)
How to replace unicode escape character in Dart
Unicode characters and escape characters aren't stored the way you write them when you wrote the string -- they are converted to their own values. This is evident when you run the following code:
print('\\u2013'.length); // Prints: 6
print('\u2013'.length); // Prints: 1
Here, what happened was: the first stored the following characters: '\', 'u', '2', '0', '1', and '3' -- while the latter stored '–' only.
Hence, your attempt to change the first by replacing two slashes \\
with one slashes \
wouldn't work, as the compiler isn't converting your unicode escape characters any longer.
That doesn't mean that you won't be able to convert your unicode codes into unicode characters though. You could use the following code:
final String str = 'Jeremiah 52:1\\u2013340';
final Pattern unicodePattern = new RegExp(r'\\u([0-9A-Fa-f]{4})');
final String newStr = str.replaceAllMapped(unicodePattern, (Match unicodeMatch) {
final int hexCode = int.parse(unicodeMatch.group(1), radix: 16);
final unicode = String.fromCharCode(hexCode);
return unicode;
});
print('Old string: $str');
print('New string: $newStr');
Replace Unicode escapes with the corresponding character
Joao's answer is probably the simplest, but this function can help when you don't want to have to download the apache jar, whether for space reasons, portability reasons, or you just don't want to mess with licenses or other Apache cruft. Also, since it doesn't have very much functionality, I think it should be faster. Here it is:
public static String unescapeUnicode(String s) {
StringBuilder sb = new StringBuilder();
int oldIndex = 0;
for (int i = 0; i + 2 < s.length(); i++) {
if (s.substring(i, i + 2).equals("\\u")) {
sb.append(s.substring(oldIndex, i));
int codePoint = Integer.parseInt(s.substring(i + 2, i + 6), 16);
sb.append(Character.toChars(codePoint));
i += 5;
oldIndex = i + 1;
}
}
sb.append(s.substring(oldIndex, s.length()));
return sb.toString();
}
I hope this helps! (You don't have to give me credit for this, I give it to public domain)
How to replace escaped unicode characters with proper unicode characters?
Your regex extracts JSON strings from a webpage:
searched_results = re.findall(r"(?<=,\"ou\":\")[^\s]+[\w](?=\",\"ow\")", results_source)
Those "
chacarters you removed were actually significant. The \uxxxx
escape syntax here is specific to JSON (and Javascript) syntax; they are closely related to Python's use but different (not much, but it matters when you have non-BMP codepoints).
You can trivially decode them as JSON , if you keep the quotes in there:
searched_results = map(json.loads, re.findall(r"(?<=,\"ou\":)\"[^\s]+[\w]\"(?=,\"ow\")", results_source))
Better still would be to use a HTML library to parse the page. When using BeautifulSoup, you can get the data with:
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(results_source, 'html.parser')
search_results = [json.loads(t.text)['ou'] for t in soup.select('.rg_meta')]
This loads the text contents of each <div class="rg_meta" ...>
element as JSON data, and extracts the ou
key from each of the resulting dictionaries. No regular expressions required.
Unicode characters replace from string using C#
Use regexp:
var unicodeRegexp = new Regex(@"\x1f");
var testWord = "our guests will experience \u001favor in an area";
var newWord = unicodeRegexp.Replace(testWord, "text for replacement");
\x1f is the replacement for \uoo1f, leading zeros should be skipped
https://www.regular-expressions.info/unicode.html#codepoint
How do convert unicode escape sequences to unicode characters in a python string
Assuming Python sees the name as a normal string, you'll first have to decode it to unicode:
>>> name
'Christensen Sk\xf6ld'
>>> unicode(name, 'latin-1')
u'Christensen Sk\xf6ld'
Another way of achieving this:
>>> name.decode('latin-1')
u'Christensen Sk\xf6ld'
Note the "u" in front of the string, signalling it is uncode. If you print this, the accented letter is shown properly:
>>> print name.decode('latin-1')
Christensen Sköld
BTW: when necessary, you can use de "encode" method to turn the unicode into e.g. a UTF-8 string:
>>> name.decode('latin-1').encode('utf-8')
'Christensen Sk\xc3\xb6ld'
Replace Unicode Characters in a String
The source code says (https://commons.apache.org/proper/commons-lang/apidocs/src-html/org/apache/commons/lang3/StringUtils.html),
public static String stripAccents(final String input) {
if (input == null) {
return null;
} final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD)); convertRemainingAccentCharacters(decomposed);
// Note that this doesn't correctly remove ligatures...
return STRIP_ACCENTS_PATTERN.matcher(decomposed).replaceAll(EMPTY);
}
It has a comment that says,// Note that this doesn't correctly remove ligatures...
So may be you need to manually replace those instances.
Something like,
String string = Normalizer.normalize("Tĥïŝ ĩš â fůňķŷ ß æ œ ø Šťŕĭńġ", Normalizer.Form.NFKD);
string = string.replaceAll("\\p{M}", "");
string = string.replace("ß", "s");
string = string.replace("ø", "o");
string = string.replace("œ", "o");
string = string.replace("æ", "a");
Diacritical Character to ASCII Character Mapping
https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html
Related Topics
Resharper Warning - Access to Modified Closure
3D Relative Angle Sum Calculation
Struggling Trying to Get Cookie Out of Response with Httpclient in .Net 4.5
Iis Wcf Service Hosting VS Windows Service
Assign Format of Datetime with Data Annotations
Linq Group by Multiple Fields -Syntax Help
Unload a Dll Loaded Using Dllimport
How to Display a File's Properties Dialog from C#
Uncompressing Gzip Response from Webclient
Client Is Unauthorized to Retrieve Access Tokens Using This Method Gmail API C#
Wrap C# Application in .Msi Installer
How to Implement Gzip Compression in ASP.NET
Wpf C#: Rearrange Items in Listbox via Drag and Drop
Compare Two Lists for Differences
What Is the C# Equivalent of Nan or Isnumeric
When Is a C# Value/Object Copied and When Is Its Reference Copied