Regex to Extract Boundary and Content Type Out of Mail Headers

Regex to extract boundary and content type out of Mail headers

This regex will match the text:

/Type:\s{0,}(.*?);\sboundary=\"(.*?)\"/

Note the two capture groups for the data you want.

regexp get content type boundary

Try this

preg_match('/boundary="(.*?)"/i', $header, $match);

preg_match('/Content-Type:(.*)\n*\s*boundary="([^\"]+)"/i', $header, $match);

Output

9B095B5ADSN=_01D16F24CC6015F600AB1926COL004?MC5F18.ho

extract body from raw email with regex

I think you'd be better going down the email line at a time as it's the line breaks that are more critical in e-mail formation.

Your rules would be:

If you get a double line break, then the body is starting - plain text type (as there are no headers to indicate which).
Otherwise, carry on until you get the "boundary=" bit, and then you record the boundary and hop into a "looking for boundary" mode.
Then, when you find a boundary, hop into "Looking for content-type or double new-line" mode, and look for Content-Type (and note content-Type) or double new-line (header has finished, body coming next until the next boundary)
While reading the body of the message, you're back in "looking for boundary" mode to repeat teh process.

Something I remember from a long time ago - so the following may not be 100% accurate, but I'll mention just in case. Be careful with files with attachemnts as you can get two "boundary" markers. But one boundary is withing another boundary, so if you follow the rules above (i.e. grab the first boundary and stick with it) then you should be fine. But test your script with some attachemnts :)

Edit: additional info as asked in the question. An e-mail can have as many "bodies" as the user wishes to encode. You can have a plain, and HTML, a UTF encoded version, and RTF version or even a Morse Code version (if the client knew how to handle "Content-Type Morse/Code"!). Sometimes you don't get plain text, but only HTML versions (naughty users). Sometimes the HTML actually comes without the content type declaration (which may or may not get displayed as HTML, depending on the client). The boundary also splits off the attachments. Rich test is a gotcha from Outlook (although, to be fair, it usually IS converted to HTML). So no, there's somewhere between 0 and X bodies.

regexp get content type boundary

Try this

preg_match('/boundary="(.*?)"/i', $header, $match);

preg_match('/Content-Type:(.*)\n*\s*boundary="([^\"]+)"/i', $header, $match);

Output

9B095B5ADSN=_01D16F24CC6015F600AB1926COL004?MC5F18.ho

Using regex to separate a multipart email

There are a few open source C# MIME parsers available:

http://mimeparser.codeplex.com/
http://anmar.eu.org/projects/sharpmimetools/
http://www.codeproject.com/KB/cs/MIME_De_Encode_in_C_.aspx

The last two are a bit old. If they don't easily compile, their source might point you in the right direction.

Remember, an email can contain an attachment that is an email that contains an attachment, etc, etc... At some point, Regex will let you down.

PHP Regular Expressions (REGEX) Multipart MIME (NOT-EMAIL)

Alternatively, you could parse with explode() this should be much faster, it's not too complex, and it gives you the header info if you want it:

<?php

$body = file_get_contents('output.txt');
$boundary = '__NEXT_PART_gc0p4Jq0M2Yt08jU534c0p__';
$parts = explode("--$boundary", $body);
array_shift($parts); # delete up to the first boundary
array_pop($parts); # delete after the last boundary

$binaries = array();
foreach($parts as $part) {
    list($header, $binary) = explode("\n\n", $part, 2);
    $binaries[] = $binary;
}    

print_r($binaries);

Parsing email headers with regular expressions in python

Your demo text is practicallly the mbox format, which can be perfectly processed with the appropriate object in the mailbox module:

from mailbox import mbox
import re

PAT_EMAIL = re.compile(r"[0-9A-Za-z._-]+\@[0-9A-Za-z._-]+")

mymbox = mbox("demo.txt")
for email in mymbox.values():
    from_address = PAT_EMAIL.findall(email["from"])
    to_address = PAT_EMAIL.findall(email["to"])
    date = [ email["date"], ]
    print ";".join(from_address + to_address + date)

Regular expression for matching content between lines in ruby

Ok, so the solution for this was pretty simple, I ended up with an expression like the following:

--Apple-Mail=_EFA7D6C2-C778-4C8E-AA13-C97DF1FA9036.+?(On \\d{0,2}[\\/\\-]\\d{0,2}[\\/\\-]\\d{0,4}.+?)--Apple-Mail=_EFA7D6C2-C778-4C8E-AA13-C97DF1FA9036

No need to perform a look-ahead/behind for this.

Regex to Extract Boundary and Content Type Out of Mail Headers