Python - Email Header Decoding Utf-8

Python IMAP: =?utf-8?Q? in subject string

In MIME terminology, those encoded chunks are called encoded-words. You can decode them like this:

import email.header
text, encoding = email.header.decode_header('=?utf-8?Q?Subject?=')[0]

Check out the docs for email.header for more details.

decode utf8 mail header

According to RFC 2047,

An 'encoded-word' MUST NOT appear within a 'quoted-string'.

A 'quoted-string' according to RFC 822 is

quoted-string = <"> *(qtext/quoted-pair) <">; Regular qtext or quoted chars.

So I think the Python library is right, as

"=?utf-8?q?Abschlags=C3=A4nderung?="

is a quoted string. A better alternative with minimal quoting would be

=?utf-8?q?=22Abschlags=C3=A4nderung=22?=

having the " encoded as =22.

You could parse them by replacing the " with =?utf-8?q?=22?=:

>>> email.Header.decode_header('=?utf-8?q?=22?= =?utf-8?q?Abschlags=C3=A4nderung?= =?utf-8?q?=22?=')
[('"Abschlags\xc3\xa4nderung"', 'utf-8')]

How to get email.Header.decode_header to work with non-ASCII characters?

The header has to be encoded correctly in order to be decoded. It looks like val comes from an already existing message, so maybe that message is bad. The error indicates it is a Unicode string, but it should be a byte string at that point. The examples on in the Python help for email.header are straightforward.

Below encodes two headers that don't even use the same encoding:

>>> import email.header
>>> h = email.header.Header(u'To: Märk'.encode('iso-8859-1'),'iso-8859-1')
>>> h.append(u'From: Jòhñ'.encode('utf8'),'utf8')
>>> h
<email.header.Header instance at 0x00559F58>
>>> s = h.encode()
>>> s
'=?iso-8859-1?q?To=3A_M=E4rk?= =?utf-8?b?RnJvbTogSsOyaMOx?='

Note that the correctly encoded header is a byte string with the encoding names embedded, and it uses no non-ASCII characters.

This decodes them:

>>> email.header.decode_header(s)
[('To: M\xe4rk', 'iso-8859-1'), ('From: J\xc3\xb2h\xc3\xb1', 'utf-8')]
>>> d = email.header.decode_header(s)
>>> for s,e in d:
... print s.decode(e)
...
To: Märk
From: Jòhñ

How to read emails with utf8 characters using imaplib

Got it working like this:

import imaplib
import email
import email.policy

mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('email@gmail.com', 'pass')
mail.select('inbox')
res, mailList = mail.search(None, "FROM", "noreply@test.com")
for x in reversed(mailList[0].split()):
typ, dataMail = mail.fetch(str(int(x)), '(RFC822)')
msg = email.message_from_bytes(dataMail[0][1], policy=email.policy.default)

#get the body of the email
body = ""
for part in msg.walk():
charset = part.get_content_charset()
if part.get_content_type() == "text/plain":
partStr = part.get_payload(decode=True)
body += partStr.decode(charset)

print(body)

Incorrect decoding in Python3 email module

This appears to be a problem with how the email.parser infrastructure is handling unfolding of multi-line headers containing encoded-word tokens for the From header and other structured headers. It does this correctly for unstructured headers such as Subject.

Your header has two encoded word parts, on two separate lines. This is perfectly normal, an encoded-word token has limited space (there is a maximum length limit) and so your UTF-8 data was split into two such words, and there is a line-separator plus space in-between. All great and fine. Whatever generated the email was wrong to split in the middle of a UTF-8 character (RFC2047 states that is strictly forbidden), a decoder of such data should not insert spaces between the decoded bytes. It is the extra space that then prevents the email header handling from joining the surrogates and repairing the data.

So this appears to be a bug in the way the headers are parsed when handling structured headers; the parser does not correctly handle spaces between encoded words, here the space was introduced by the folded header line. This then results in the space being preserved in between the two encoded-word parts, preventing proper decoding. So while RFC2047 does state that encoded-word sections MUST contain whole characters (multi-byte encodings must not be split), it also states that encoded words can be split up with CRLF SPACE delimiters and any spaces in between encoded words are to be ignored.

You can work around this by supplying a custom policy class, which removes the leading white space from lines in your own implementation of the Policy.header_fetch_parse() method.

import re
from email.policy import EmailPolicy

class UnfoldingEncodedStringHeaderPolicy(EmailPolicy):
def header_fetch_parse(self, name, value):
# remove any leading white space from header lines
# that separates apparent encoded-word tokens before further processing
# using somewhat crude CRLF-FWS-between-encoded-word matching
value = re.sub(r'(?<=\?=)((?:\r\n|[\r\n])[\t ]+)(?==\?)', '', value)
return super().header_fetch_parse(name, value)

and use that as your policy when loading:

custom_policy = UnfoldingEncodedStringHeaderPolicy()

with open(argv[1], 'r', encoding='utf-8') as eml_file:
msg = Parser(policy=custom_policy).parse(eml_file)

Demo:

>>> from io import StringIO
>>> from email.parser import Parser
>>> from email.policy import default as default_policy
>>> custom_policy = UnfoldingEncodedStringHeaderPolicy()
>>> Parser(policy=default_policy).parse(StringIO(data))['from']
'彭以国/第二事业部项目部/第二事业� �� <addr@addr.com>'
>>> Parser(policy=custom_policy).parse(StringIO(data))['from']
'彭以国/第二事业部项目部/第二事业部 <addr@addr.com>'

I filed Python issue #35547 to track this.

MIME email Subject etc. headers vs. utf8: first split, then encode?

As per RFC 2047 § 8's examples (and the overall explanation) an encoded-word does not magically span over several instances:

  • =?UTF-8?Q?a?= neither continues a previous encoded-word, nor can it be continued with a following encoded-word - it is, what it is: a.
  • It is more obvious when we mix text encodings: =?UTF-8?Q?a?= =?ISO-8859-1?Q?b?= should render as ab, and it is clear that cutting UTF-8 inbetween would only halfway work when the next encoded-word is UTF-8 again (while a different text encoding surely uses different bytes).

As a logical consequence UTF-8 should be splitted by characters, not bytes. Which means: both encoding B (Base64) and Q (Quoted) should not be cut (unless the cut is coincidentially also between the encoded text's characters) - the cutting must occur before.

I can only guess this is "too complicated" for a few programmers and they just think "it won't break anything anyway - so far nobody complained". But if an encoded-word must be cut, the proper way is to first decode it so that the text can be cut character-wise (instead of byte-wise), and then to encode both parts again. One caveat is: who does so must also support said text encoding - while UTF-8 is widespread today, would a software also know where to cut Shift-JIS and Big5 and UTF-16BE?



Related Topics



Leave a reply



Submit