Python IMAP: =?utf-8?Q? in subject string
In MIME terminology, those encoded chunks are called encoded-words. You can decode them like this:
import email.header
text, encoding = email.header.decode_header('=?utf-8?Q?Subject?=')[0]
Check out the docs for email.header
for more details. decode utf8 mail header
According to RFC 2047,
An 'encoded-word' MUST NOT appear within a 'quoted-string'.A 'quoted-string' according to RFC 822 is
quoted-string = <"> *(qtext/quoted-pair) <">; Regular qtext or quoted chars.So I think the Python library is right, as
"=?utf-8?q?Abschlags=C3=A4nderung?="
is a quoted string. A better alternative with minimal quoting would be=?utf-8?q?=22Abschlags=C3=A4nderung=22?=
having the "
encoded as =22
.You could parse them by replacing the "
with =?utf-8?q?=22?=
:
>>> email.Header.decode_header('=?utf-8?q?=22?= =?utf-8?q?Abschlags=C3=A4nderung?= =?utf-8?q?=22?=')
[('"Abschlags\xc3\xa4nderung"', 'utf-8')]
How to get email.Header.decode_header to work with non-ASCII characters?
The header has to be encoded correctly in order to be decoded. It looks like val
comes from an already existing message, so maybe that message is bad. The error indicates it is a Unicode string, but it should be a byte string at that point. The examples on in the Python help for email.header are straightforward.
Below encodes two headers that don't even use the same encoding:
>>> import email.header
>>> h = email.header.Header(u'To: Märk'.encode('iso-8859-1'),'iso-8859-1')
>>> h.append(u'From: Jòhñ'.encode('utf8'),'utf8')
>>> h
<email.header.Header instance at 0x00559F58>
>>> s = h.encode()
>>> s
'=?iso-8859-1?q?To=3A_M=E4rk?= =?utf-8?b?RnJvbTogSsOyaMOx?='
Note that the correctly encoded header is a byte string with the encoding names embedded, and it uses no non-ASCII characters.This decodes them:
>>> email.header.decode_header(s)
[('To: M\xe4rk', 'iso-8859-1'), ('From: J\xc3\xb2h\xc3\xb1', 'utf-8')]
>>> d = email.header.decode_header(s)
>>> for s,e in d:
... print s.decode(e)
...
To: Märk
From: Jòhñ
How to read emails with utf8 characters using imaplib
Got it working like this:
import imaplib
import email
import email.policy
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('email@gmail.com', 'pass')
mail.select('inbox')
res, mailList = mail.search(None, "FROM", "noreply@test.com")
for x in reversed(mailList[0].split()):
typ, dataMail = mail.fetch(str(int(x)), '(RFC822)')
msg = email.message_from_bytes(dataMail[0][1], policy=email.policy.default)
#get the body of the email
body = ""
for part in msg.walk():
charset = part.get_content_charset()
if part.get_content_type() == "text/plain":
partStr = part.get_payload(decode=True)
body += partStr.decode(charset)
print(body)
Incorrect decoding in Python3 email module
This appears to be a problem with how the email.parser
infrastructure is handling unfolding of multi-line headers containing encoded-word tokens for the From header and other structured headers. It does this correctly for unstructured headers such as Subject
.
Your header has two encoded word parts, on two separate lines. This is perfectly normal, an encoded-word token has limited space (there is a maximum length limit) and so your UTF-8 data was split into two such words, and there is a line-separator plus space in-between. All great and fine. Whatever generated the email was wrong to split in the middle of a UTF-8 character (RFC2047 states that is strictly forbidden), a decoder of such data should not insert spaces between the decoded bytes. It is the extra space that then prevents the email
header handling from joining the surrogates and repairing the data.
So this appears to be a bug in the way the headers are parsed when handling structured headers; the parser does not correctly handle spaces between encoded words, here the space was introduced by the folded header line. This then results in the space being preserved in between the two encoded-word parts, preventing proper decoding. So while RFC2047 does state that encoded-word sections MUST contain whole characters (multi-byte encodings must not be split), it also states that encoded words can be split up with CRLF SPACE delimiters and any spaces in between encoded words are to be ignored.
You can work around this by supplying a custom policy class, which removes the leading white space from lines in your own implementation of the Policy.header_fetch_parse()
method.
import re
from email.policy import EmailPolicy
class UnfoldingEncodedStringHeaderPolicy(EmailPolicy):
def header_fetch_parse(self, name, value):
# remove any leading white space from header lines
# that separates apparent encoded-word tokens before further processing
# using somewhat crude CRLF-FWS-between-encoded-word matching
value = re.sub(r'(?<=\?=)((?:\r\n|[\r\n])[\t ]+)(?==\?)', '', value)
return super().header_fetch_parse(name, value)
and use that as your policy when loading:custom_policy = UnfoldingEncodedStringHeaderPolicy()
with open(argv[1], 'r', encoding='utf-8') as eml_file:
msg = Parser(policy=custom_policy).parse(eml_file)
Demo:>>> from io import StringIO
>>> from email.parser import Parser
>>> from email.policy import default as default_policy
>>> custom_policy = UnfoldingEncodedStringHeaderPolicy()
>>> Parser(policy=default_policy).parse(StringIO(data))['from']
'彭以国/第二事业部项目部/第二事业� �� <addr@addr.com>'
>>> Parser(policy=custom_policy).parse(StringIO(data))['from']
'彭以国/第二事业部项目部/第二事业部 <addr@addr.com>'
I filed Python issue #35547 to track this. MIME email Subject etc. headers vs. utf8: first split, then encode?
As per RFC 2047 § 8's examples (and the overall explanation) an encoded-word does not magically span over several instances:
=?UTF-8?Q?a?=
neither continues a previous encoded-word, nor can it be continued with a following encoded-word - it is, what it is:a
.- It is more obvious when we mix text encodings:
=?UTF-8?Q?a?= =?ISO-8859-1?Q?b?=
should render asab
, and it is clear that cutting UTF-8 inbetween would only halfway work when the next encoded-word is UTF-8 again (while a different text encoding surely uses different bytes).
I can only guess this is "too complicated" for a few programmers and they just think "it won't break anything anyway - so far nobody complained". But if an encoded-word must be cut, the proper way is to first decode it so that the text can be cut character-wise (instead of byte-wise), and then to encode both parts again. One caveat is: who does so must also support said text encoding - while UTF-8 is widespread today, would a software also know where to cut Shift-JIS and Big5 and UTF-16BE?
Related Topics
How to Save and Restore Multiple Variables in Python
Scipy Curve_Fit Doesn't Like Math Module
How to Release Memory After Creating Matplotlib Figures
Where Do the Python Unit Tests Go
Does Python Do Slice-By-Reference on Strings
How to Frame Two for Loops in List Comprehension Python
Python and Operator on Two Boolean Lists - How
Determine Prefix from a Set of (Similar) Strings
Inline CSV File Editing with Python
Find Index of Last Occurrence of a Substring in a String
How to Check If Stdin Has Some Data
Python - Datetime with Timezone to Epoch
Disable Console Messages in Flask Server