How to Convert Surrogate Pairs to Normal String in Python

How can I convert surrogate pairs to normal string in Python?

You've mixed a literal string \ud83d in a json file on disk (six characters: \ u d 8 3 d) and a single character u'\ud83d' (specified using a string literal in Python source code) in memory. It is the difference between len(r'\ud83d') == 6 and len('\ud83d') == 1 on Python 3.

If you see '\ud83d\ude4f' Python string (2 characters) then there is a bug upstream. Normally, you shouldn't get such string. If you get one and you can't fix upstream that generates it; you could fix it using surrogatepass error handler:

>>> "\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16')
''

Python 2 was more permissive.

Note: even if your json file contains literal \ud83d\ude4f (12 characters); you shouldn't get the surrogate pair:

>>> print(ascii(json.loads(r'"\ud83d\ude4f"')))
'\U0001f64f'

Notice: the result is 1 character ( '\U0001f64f'), not the surrogate pair ('\ud83d\ude4f').

Python: Find equivalent surrogate pair from non-BMP unicode char

You'll have to manually replace each non-BMP point with the surrogate pair. You could do this with a regular expression:

import re

_nonbmp = re.compile(r'[\U00010000-\U0010FFFF]')

def _surrogatepair(match):
char = match.group()
assert ord(char) > 0xffff
encoded = char.encode('utf-16-le')
return (
chr(int.from_bytes(encoded[:2], 'little')) +
chr(int.from_bytes(encoded[2:], 'little')))

def with_surrogates(text):
return _nonbmp.sub(_surrogatepair, text)

Demo:

>>> with_surrogates('\U0001f64f')
'\ud83d\ude4f'

PYTHON RE Dont split UNICODE Chars into surrogate pairs while matching

Use a regular expression that matches a surrogate pair or anything. This will work in wide and narrow builds of Python 2, but isn't needed in a wide build since it doesn't use surrogate pairs.

Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:19:30) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> te = u'A\u5200\U0001f600\U0001f601\u5100Z'
>>> print re.findall(ur'[\ud800-\udbff][\udc00-\udfff]|.', te, re.UNICODE)
[u'A', u'\u5200', u'\U0001f600', u'\U0001f601', u'\u5100', u'Z']

This will still work in the latest Python 3, but also isn't needed because surrogate pairs are no longer used in Unicode strings (no wide or narrow build anymore):

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> te = u'A\u5200\U0001f600\U0001f601\u5100Z'
>>> print(re.findall(r'[\ud800-\udbff][\udc00-\udfff]|.', te))
['A', '刀', '', '', '儀', 'Z']

Works without the surrogate match:

>>> print(re.findall(r'.', te))
['A', '刀', '', '', '儀', 'Z']

And then you can just iterate normally in Python 3:

>>> for c in te:
... print(c)
...
A

br>br>儀
Z

Note there is still an issue with graphemes (Unicode code point combinations that represent a single character. Here's a bad case:

>>> s = '‍‍‍'
>>> for c in s:
... print(c)
...
br>br>‍
br>br>‍
br>br>‍
br>br>

The regex 3rd party module can match graphemes:

>>> import regex
>>> s = '‍‍‍'
>>> for c in regex.findall('\X',s):
... print(c)
...
‍‍‍br>

Python: getting correct string length when it contains surrogate pairs

I think this has been fixen in 3.3. See:

http://docs.python.org/py3k/whatsnew/3.3.html

http://www.python.org/dev/peps/pep-0393/ (search for wstr_length)

c++: how to remove surrogate unicode values from string?

First off, you can't store UTF-16 surrogates in a std::string (char-based), you would need std::u16string (char16_t-based), or std::wstring (wchar_t-based) on Windows only. Javascript strings are UTF-16 strings.

For those string types, you can use either:

  • std::remove_if() + std::basic_string::erase():

    #include <string>
    #include <algorithm>

    std::u16string str; // or std::wstring on Windows
    ...
    str.erase(
    std::remove_if(str.begin(), str.end(),
    [](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); }
    ),
    str.end()
    );
  • std::erase_if() (C++20 and later only):

    #include <string>

    std::u16string str; // or std::wstring on Windows
    ...
    std::erase_if(str,
    [](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); }
    );

UPDATE: You edited your question to change its semantics. Originally, you asked how to remove surrogates, now you are asking how to replace them instead. You can use std::replace_if() for that task, eg:

#include <string>
#include <algorithm>

std::u16string str; // or std::wstring on Windows
...
std::replace_if(str.begin(), str.end(),
[](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); },
u'_'
);

Or, if you really want a regex-based approach, you can use std::regex_replace(), eg:

#include <string>
#include <regex>

std::wstring str; // std::basic_regex does not support char16_t strings!
...
std::wstring newstr = std::regex_replace(
str,
std::wregex(L"[\\uD800-\\uDFFF]"),
L"_"
);


Related Topics



Leave a reply



Submit