Best Way to Handle Email Parsing/Decoding in PHP

Best way to handle email parsing/decoding in PHP?

Funny you should ask... Im actually working on a simple notification system now. I just finished up the Bounce Manager with i use Zend_Mail to implement. It has pretty much all the features you're looking for... you can connect to a mailbox (POP3, IMAP, Mbox, and Maildir) and pull messages from it as well as operate on all those messages.

It handles multipart messages, but the parts can be hard to work with. I had a hard time figuring out which part was the attached original message part in the NDR's I was working with, but I have a feeling I just missed something in the documentation. I'm not sure how it handles encoding, because my usage was fairly simple but I'm pretty sure it has provisions for all the encodings you mentioned. Check out the docs and browse the API.

parsing raw email in php

What are you hoping to end up with at the end? The body, the subject, the sender, an attachment? You should spend some time with RFC2822 to understand the format of the mail, but here's the simplest rules for well formed email:

HEADERS\n
\n
BODY

That is, the first blank line (double newline) is the separator between the HEADERS and the BODY. A HEADER looks like this:

HSTRING:HTEXT

HSTRING always starts at the beginning of a line and doesn't contain any white space or colons. HTEXT can contain a wide variety of text, including newlines as long as the newline char is followed by whitespace.

The "BODY" is really just any data that follows the first double newline. (There are different rules if you are transmitting mail via SMTP, but processing it over a pipe you don't have to worry about that).

So, in really simple, circa-1982 RFC822 terms, an email looks like this:

HEADER: HEADER TEXT
HEADER: MORE HEADER TEXT
INCLUDING A LINE CONTINUATION
HEADER: LAST HEADER

THIS IS ANY
ARBITRARY DATA
(FOR THE MOST PART)

Most modern email is more complex than that though. Headers can be encoded for charsets or RFC2047 mime words, or a ton of other stuff I'm not thinking of right now. The bodies are really hard to roll your own code for these days to if you want them to be meaningful. Almost all email that's generated by an MUA will be MIME encoded. That might be uuencoded text, it might be html, it might be a uuencoded excel spreadsheet.

I hope this helps provide a framework for understanding some of the very elemental buckets of email. If you provide more background on what you are trying to do with the data I (or someone else) might be able to provide better direction.

How to decode subject and date which were retrieved from gmail (Japan text)

 public function encodeToUtf8($string) {
mb_internal_encoding("utf-8");
$str_japan = mb_convert_encoding($string, "UTF-8", "ISO-2022-JP");
//"=?iso-2022-jp?B?GyRCQjZCPjhsOEBCLDtuGyhCIC0gdGVzdGluZw==?=";
return mb_decode_mimeheader($str_japan);
}

usage :

encodeToUtf8($overview[0]->date);
encodeToUtf8($overview[0]->subject);

Parsing Email Body with 7BIT Content-Transfer-Encoding - PHP

After spending a bit more time, I decided to just write up some heuristic detection, as Max suggested in the comments on my original question.

I've built a more robust decode7Bit() method in Imap.php, which goes through a bunch of common encoded characters (like =A0) and replaces them with their UTF-8 equivalents, and then also decodes messages if they look like they are base64-encoded:

/**
* Decodes 7-Bit text.
*
* PHP seems to think that most emails are 7BIT-encoded, therefore this
* decoding method assumes that text passed through may actually be base64-
* encoded, quoted-printable encoded, or just plain text. Instead of passing
* the email directly through a particular decoding function, this method
* runs through a bunch of common encoding schemes to try to decode everything
* and simply end up with something *resembling* plain text.
*
* Results are not guaranteed, but it's pretty good at what it does.
*
* @param $text (string)
* 7-Bit text to convert.
*
* @return (string)
* Decoded text.
*/
public function decode7Bit($text) {
// If there are no spaces on the first line, assume that the body is
// actually base64-encoded, and decode it.
$lines = explode("\r\n", $text);
$first_line_words = explode(' ', $lines[0]);
if ($first_line_words[0] == $lines[0]) {
$text = base64_decode($text);
}

// Manually convert common encoded characters into their UTF-8 equivalents.
$characters = array(
'=20' => ' ', // space.
'=E2=80=99' => "'", // single quote.
'=0A' => "\r\n", // line break.
'=A0' => ' ', // non-breaking space.
'=C2=A0' => ' ', // non-breaking space.
"=\r\n" => '', // joined line.
'=E2=80=A6' => '…', // ellipsis.
'=E2=80=A2' => '•', // bullet.
);

// Loop through the encoded characters and replace any that are found.
foreach ($characters as $key => $value) {
$text = str_replace($key, $value, $text);
}

return $text;
}

This was taken from version 1.0-beta2 of the Imap class for PHP that I have on GitHub.

If you have any ideas for making this more efficient, let me know. I originally tried running everything through quoted_printable_decode(), but sometimes PHP would throw exceptions that were vague and unhelpful, so I gave up on that approach.

Proper way to decode incoming email subject (utf 8)

Despite the fact that this is almost a year old - I found this and am facing a similar problem.

I'm unsure why you're getting odd characters, but perhaps you are trying to display them somewhere your charset is unsupported.

Here's some code I wrote which should handle everything except the charset conversion, which is a large problem that many libraries handle much better. (PHP's MB library, for instance)

class mail {
/**
* If you change one of these, please check the other for fixes as well
*
* @const Pattern to match RFC 2047 charset encodings in mail headers
*/
const rfc2047header = '/=\?([^ ?]+)\?([BQbq])\?([^ ?]+)\?=/';

const rfc2047header_spaces = '/(=\?[^ ?]+\?[BQbq]\?[^ ?]+\?=)\s+(=\?[^ ?]+\?[BQbq]\?[^ ?]+\?=)/';

/**
* http://www.rfc-archive.org/getrfc.php?rfc=2047
*
* =?<charset>?<encoding>?<data>?=
*
* @param string $header
*/
public static function is_encoded_header($header) {
// e.g. =?utf-8?q?Re=3a=20Support=3a=204D09EE9A=20=2d=20Re=3a=20Support=3a=204D078032=20=2d=20Wordpress=20Plugin?=
// e.g. =?utf-8?q?Wordpress=20Plugin?=
return preg_match(self::rfc2047header, $header) !== 0;
}

public static function header_charsets($header) {
$matches = null;
if (!preg_match_all(self::rfc2047header, $header, $matches, PREG_PATTERN_ORDER)) {
return array();
}
return array_map('strtoupper', $matches[1]);
}

public static function decode_header($header) {
$matches = null;

/* Repair instances where two encodings are together and separated by a space (strip the spaces) */
$header = preg_replace(self::rfc2047header_spaces, "$1$2", $header);

/* Now see if any encodings exist and match them */
if (!preg_match_all(self::rfc2047header, $header, $matches, PREG_SET_ORDER)) {
return $header;
}
foreach ($matches as $header_match) {
list($match, $charset, $encoding, $data) = $header_match;
$encoding = strtoupper($encoding);
switch ($encoding) {
case 'B':
$data = base64_decode($data);
break;
case 'Q':
$data = quoted_printable_decode(str_replace("_", " ", $data));
break;
default:
throw new Exception("preg_match_all is busted: didn't find B or Q in encoding $header");
}
// This part needs to handle every charset
switch (strtoupper($charset)) {
case "UTF-8":
break;
default:
/* Here's where you should handle other character sets! */
throw new Exception("Unknown charset in header - time to write some code.");
}
$header = str_replace($match, $data, $header);
}
return $header;
}
}

When run through a script and displayed in a browser using UTF-8, the result is:

آزمایش

You would run it like so:

$decoded = mail::decode_header("=?UTF-8?B?2KLYstmF2KfbjNi0?=");

PHP email body decoding

You've likely selected the wrong body part for those messages. Those are binary files: one is a GIF file (see the GIF89 header), and the other is a ZIP file (see the PK header).

PHP imap how to decode email body correctly?

I suggest you have a look at this: https://github.com/mantisbt-plugins/EmailReporting/blob/master/core/Mail/Parser.php
It uses the stuff you've used as well but it adds character encoding on top of it.

Subject character encoding can happen inline. For email bodies its one character set for the entire body

The script given uses a pear package, not the IMAP extension but based on your input it should be pretty equal

Hope this helps

PHP IMAP decoding messages

imap_bodystruct() or imap_fetchstructure() should return this info to you. The following code should do exactly what you're looking for:

<?php
$hostname = '{********:993/imap/ssl}INBOX';
$username = '*********';
$password = '******';

$inbox = imap_open($hostname,$username,$password) or die('Cannot connect to server: ' . imap_last_error());

$emails = imap_search($inbox,'ALL');

if($emails) {
$output = '';
rsort($emails);

foreach($emails as $email_number) {
$overview = imap_fetch_overview($inbox,$email_number,0);
$structure = imap_fetchstructure($inbox, $email_number);

if(isset($structure->parts) && is_array($structure->parts) && isset($structure->parts[1])) {
$part = $structure->parts[1];
$message = imap_fetchbody($inbox,$email_number,2);

if($part->encoding == 3) {
$message = imap_base64($message);
} else if($part->encoding == 1) {
$message = imap_8bit($message);
} else {
$message = imap_qprint($message);
}
}

$output.= '<div class="toggle'.($overview[0]->seen ? 'read' : 'unread').'">';
$output.= '<span class="from">From: '.utf8_decode(imap_utf8($overview[0]->from)).'</span>';
$output.= '<span class="date">on '.utf8_decode(imap_utf8($overview[0]->date)).'</span>';
$output.= '<br /><span class="subject">Subject('.$part->encoding.'): '.utf8_decode(imap_utf8($overview[0]->subject)).'</span> ';
$output.= '</div>';

$output.= '<div class="body">'.$message.'</div><hr />';
}

echo $output;
}

imap_close($inbox);
?>


Related Topics



Leave a reply



Submit