Replace Unicode Escape Sequences in a String

Replace Unicode escape sequences in a string

You could use a regular expression to parse the file:

private static Regex _regex = new Regex(@"\\u(?<Value>[a-zA-Z0-9]{4})", RegexOptions.Compiled);

public string Decoder(string value)
{
return _regex.Replace(
value,
m => ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString()
);
}

And then:

string data = Decoder(File.ReadAllText("test.txt"));

How to replace unicode characters in string with something else python?

  1. Decode the string to Unicode. Assuming it's UTF-8-encoded:

    str.decode("utf-8")
  2. Call the replace method and be sure to pass it a Unicode string as its first argument:

    str.decode("utf-8").replace(u"\u2022", "*")
  3. Encode back to UTF-8, if needed:

    str.decode("utf-8").replace(u"\u2022", "*").encode("utf-8")

(Fortunately, Python 3 puts a stop to this mess. Step 3 should really only be performed just prior to I/O. Also, mind you that calling a string str shadows the built-in type str.)

How to replace unicode escape character in Dart

Unicode characters and escape characters aren't stored the way you write them when you wrote the string -- they are converted to their own values. This is evident when you run the following code:

print('\\u2013'.length); // Prints: 6
print('\u2013'.length); // Prints: 1

Here, what happened was: the first stored the following characters: '\', 'u', '2', '0', '1', and '3' -- while the latter stored '–' only.

Hence, your attempt to change the first by replacing two slashes \\ with one slashes \ wouldn't work, as the compiler isn't converting your unicode escape characters any longer.

That doesn't mean that you won't be able to convert your unicode codes into unicode characters though. You could use the following code:

final String str = 'Jeremiah  52:1\\u2013340';
final Pattern unicodePattern = new RegExp(r'\\u([0-9A-Fa-f]{4})');
final String newStr = str.replaceAllMapped(unicodePattern, (Match unicodeMatch) {
final int hexCode = int.parse(unicodeMatch.group(1), radix: 16);
final unicode = String.fromCharCode(hexCode);
return unicode;
});
print('Old string: $str');
print('New string: $newStr');

Replace Unicode escapes with the corresponding character

Joao's answer is probably the simplest, but this function can help when you don't want to have to download the apache jar, whether for space reasons, portability reasons, or you just don't want to mess with licenses or other Apache cruft. Also, since it doesn't have very much functionality, I think it should be faster. Here it is:

public static String unescapeUnicode(String s) {
StringBuilder sb = new StringBuilder();

int oldIndex = 0;

for (int i = 0; i + 2 < s.length(); i++) {
if (s.substring(i, i + 2).equals("\\u")) {
sb.append(s.substring(oldIndex, i));
int codePoint = Integer.parseInt(s.substring(i + 2, i + 6), 16);
sb.append(Character.toChars(codePoint));

i += 5;
oldIndex = i + 1;
}
}

sb.append(s.substring(oldIndex, s.length()));

return sb.toString();
}

I hope this helps! (You don't have to give me credit for this, I give it to public domain)

How to replace escaped unicode characters with proper unicode characters?

Your regex extracts JSON strings from a webpage:

searched_results = re.findall(r"(?<=,\"ou\":\")[^\s]+[\w](?=\",\"ow\")", results_source)

Those " chacarters you removed were actually significant. The \uxxxx escape syntax here is specific to JSON (and Javascript) syntax; they are closely related to Python's use but different (not much, but it matters when you have non-BMP codepoints).

You can trivially decode them as JSON , if you keep the quotes in there:

searched_results = map(json.loads, re.findall(r"(?<=,\"ou\":)\"[^\s]+[\w]\"(?=,\"ow\")", results_source))

Better still would be to use a HTML library to parse the page. When using BeautifulSoup, you can get the data with:

import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(results_source, 'html.parser')
search_results = [json.loads(t.text)['ou'] for t in soup.select('.rg_meta')]

This loads the text contents of each <div class="rg_meta" ...> element as JSON data, and extracts the ou key from each of the resulting dictionaries. No regular expressions required.

Unicode characters replace from string using C#

Use regexp:

var unicodeRegexp = new Regex(@"\x1f");
var testWord = "our guests will experience \u001favor in an area";
var newWord = unicodeRegexp.Replace(testWord, "text for replacement");

\x1f is the replacement for \uoo1f, leading zeros should be skipped
https://www.regular-expressions.info/unicode.html#codepoint

How do convert unicode escape sequences to unicode characters in a python string

Assuming Python sees the name as a normal string, you'll first have to decode it to unicode:

>>> name
'Christensen Sk\xf6ld'
>>> unicode(name, 'latin-1')
u'Christensen Sk\xf6ld'

Another way of achieving this:

>>> name.decode('latin-1')
u'Christensen Sk\xf6ld'

Note the "u" in front of the string, signalling it is uncode. If you print this, the accented letter is shown properly:

>>> print name.decode('latin-1')
Christensen Sköld

BTW: when necessary, you can use de "encode" method to turn the unicode into e.g. a UTF-8 string:

>>> name.decode('latin-1').encode('utf-8')
'Christensen Sk\xc3\xb6ld'

Replace Unicode Characters in a String

The source code says (https://commons.apache.org/proper/commons-lang/apidocs/src-html/org/apache/commons/lang3/StringUtils.html),

public static String stripAccents(final String input) {
if (input == null) {
return null;
} final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD)); convertRemainingAccentCharacters(decomposed);

// Note that this doesn't correctly remove ligatures...

return STRIP_ACCENTS_PATTERN.matcher(decomposed).replaceAll(EMPTY);
}

It has a comment that says,
// Note that this doesn't correctly remove ligatures...

So may be you need to manually replace those instances.
Something like,

    String string = Normalizer.normalize("Tĥïŝ ĩš â fůňķŷ ß æ œ ø Šťŕĭńġ", Normalizer.Form.NFKD);
string = string.replaceAll("\\p{M}", "");

string = string.replace("ß", "s");
string = string.replace("ø", "o");
string = string.replace("œ", "o");
string = string.replace("æ", "a");

Diacritical Character to ASCII Character Mapping
https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html



Related Topics



Leave a reply



Submit