How can I convert surrogate pairs to normal string in Python?
You've mixed a literal string \ud83d
in a json file on disk (six characters: \ u d 8 3 d
) and a single character u'\ud83d'
(specified using a string literal in Python source code) in memory. It is the difference between len(r'\ud83d') == 6
and len('\ud83d') == 1
on Python 3.
If you see '\ud83d\ude4f'
Python string (2 characters) then there is a bug upstream. Normally, you shouldn't get such string. If you get one and you can't fix upstream that generates it; you could fix it using surrogatepass
error handler:
>>> "\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16')
''
Python 2 was more permissive.
Note: even if your json file contains literal \ud83d\ude4f (12 characters); you shouldn't get the surrogate pair:
>>> print(ascii(json.loads(r'"\ud83d\ude4f"')))
'\U0001f64f'
Notice: the result is 1 character ( '\U0001f64f'
), not the surrogate pair ('\ud83d\ude4f'
).
Python: Find equivalent surrogate pair from non-BMP unicode char
You'll have to manually replace each non-BMP point with the surrogate pair. You could do this with a regular expression:
import re
_nonbmp = re.compile(r'[\U00010000-\U0010FFFF]')
def _surrogatepair(match):
char = match.group()
assert ord(char) > 0xffff
encoded = char.encode('utf-16-le')
return (
chr(int.from_bytes(encoded[:2], 'little')) +
chr(int.from_bytes(encoded[2:], 'little')))
def with_surrogates(text):
return _nonbmp.sub(_surrogatepair, text)
Demo:
>>> with_surrogates('\U0001f64f')
'\ud83d\ude4f'
PYTHON RE Dont split UNICODE Chars into surrogate pairs while matching
Use a regular expression that matches a surrogate pair or anything. This will work in wide and narrow builds of Python 2, but isn't needed in a wide build since it doesn't use surrogate pairs.
Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:19:30) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> te = u'A\u5200\U0001f600\U0001f601\u5100Z'
>>> print re.findall(ur'[\ud800-\udbff][\udc00-\udfff]|.', te, re.UNICODE)
[u'A', u'\u5200', u'\U0001f600', u'\U0001f601', u'\u5100', u'Z']
This will still work in the latest Python 3, but also isn't needed because surrogate pairs are no longer used in Unicode strings (no wide or narrow build anymore):
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> te = u'A\u5200\U0001f600\U0001f601\u5100Z'
>>> print(re.findall(r'[\ud800-\udbff][\udc00-\udfff]|.', te))
['A', '刀', '', '', '儀', 'Z']
Works without the surrogate match:
>>> print(re.findall(r'.', te))
['A', '刀', '', '', '儀', 'Z']
And then you can just iterate normally in Python 3:
>>> for c in te:
... print(c)
...
A
刀
br>br>儀
Z
Note there is still an issue with graphemes (Unicode code point combinations that represent a single character. Here's a bad case:
>>> s = ''
>>> for c in s:
... print(c)
...
br>br>
br>br>
br>br>
br>br>
The regex
3rd party module can match graphemes:
>>> import regex
>>> s = ''
>>> for c in regex.findall('\X',s):
... print(c)
...
br>
Python: getting correct string length when it contains surrogate pairs
I think this has been fixen in 3.3. See:
http://docs.python.org/py3k/whatsnew/3.3.html
http://www.python.org/dev/peps/pep-0393/ (search for wstr_length
)
c++: how to remove surrogate unicode values from string?
First off, you can't store UTF-16 surrogates in a std::string
(char
-based), you would need std::u16string
(char16_t
-based), or std::wstring
(wchar_t
-based) on Windows only. Javascript strings are UTF-16 strings.
For those string types, you can use either:
std::remove_if()
+std::basic_string::erase()
:#include <string>
#include <algorithm>
std::u16string str; // or std::wstring on Windows
...
str.erase(
std::remove_if(str.begin(), str.end(),
[](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); }
),
str.end()
);std::erase_if()
(C++20 and later only):#include <string>
std::u16string str; // or std::wstring on Windows
...
std::erase_if(str,
[](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); }
);
UPDATE: You edited your question to change its semantics. Originally, you asked how to remove surrogates, now you are asking how to replace them instead. You can use std::replace_if()
for that task, eg:
#include <string>
#include <algorithm>
std::u16string str; // or std::wstring on Windows
...
std::replace_if(str.begin(), str.end(),
[](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); },
u'_'
);
Or, if you really want a regex-based approach, you can use std::regex_replace()
, eg:
#include <string>
#include <regex>
std::wstring str; // std::basic_regex does not support char16_t strings!
...
std::wstring newstr = std::regex_replace(
str,
std::wregex(L"[\\uD800-\\uDFFF]"),
L"_"
);
Related Topics
Rename Multiple Files in a Directory in Python
Comparing Two Numpy Arrays for Equality, Element-Wise
Convert Array of Indices to One-Hot Encoded Array in Numpy
Plotting in a Non-Blocking Way with Matplotlib
How to Fix Overlapping Annotations/Text
Slicing a List in Python Without Generating a Copy
Python Pandas Insert List into a Cell
How to Speed Up Bulk Insert to Ms SQL Server Using Pyodbc
Getting the Index of the Returned Max or Min Item Using Max()/Min() on a List
Update Value of a Nested Dictionary of Varying Depth
How to Highlight Text in a Tkinter Text Widget
Index a 2D Numpy Array with 2 Lists of Indices
Calculate Cosine Similarity Given 2 Sentence Strings
Stop Reading Process Output in Python Without Hang