Regex to extract boundary and content type out of Mail headers
This regex will match the text:
/Type:\s{0,}(.*?);\sboundary=\"(.*?)\"/
Note the two capture groups for the data you want.
regexp get content type boundary
Try this
preg_match('/boundary="(.*?)"/i', $header, $match);
or
preg_match('/Content-Type:(.*)\n*\s*boundary="([^\"]+)"/i', $header, $match);
Output
9B095B5ADSN=_01D16F24CC6015F600AB1926COL004?MC5F18.ho
extract body from raw email with regex
I think you'd be better going down the email line at a time as it's the line breaks that are more critical in e-mail formation.
Your rules would be:
- If you get a double line break, then the body is starting - plain text type (as there are no headers to indicate which).
- Otherwise, carry on until you get the "boundary=" bit, and then you record the boundary and hop into a "looking for boundary" mode.
- Then, when you find a boundary, hop into "Looking for content-type or double new-line" mode, and look for Content-Type (and note content-Type) or double new-line (header has finished, body coming next until the next boundary)
- While reading the body of the message, you're back in "looking for boundary" mode to repeat teh process.
Something I remember from a long time ago - so the following may not be 100% accurate, but I'll mention just in case. Be careful with files with attachemnts as you can get two "boundary" markers. But one boundary is withing another boundary, so if you follow the rules above (i.e. grab the first boundary and stick with it) then you should be fine. But test your script with some attachemnts :)
Edit: additional info as asked in the question. An e-mail can have as many "bodies" as the user wishes to encode. You can have a plain, and HTML, a UTF encoded version, and RTF version or even a Morse Code version (if the client knew how to handle "Content-Type Morse/Code"!). Sometimes you don't get plain text, but only HTML versions (naughty users). Sometimes the HTML actually comes without the content type declaration (which may or may not get displayed as HTML, depending on the client). The boundary also splits off the attachments. Rich test is a gotcha from Outlook (although, to be fair, it usually IS converted to HTML). So no, there's somewhere between 0 and X bodies.
regexp get content type boundary
Try this
preg_match('/boundary="(.*?)"/i', $header, $match);
or
preg_match('/Content-Type:(.*)\n*\s*boundary="([^\"]+)"/i', $header, $match);
Output
9B095B5ADSN=_01D16F24CC6015F600AB1926COL004?MC5F18.ho
Using regex to separate a multipart email
There are a few open source C# MIME parsers available:
- http://mimeparser.codeplex.com/
- http://anmar.eu.org/projects/sharpmimetools/
- http://www.codeproject.com/KB/cs/MIME_De_Encode_in_C_.aspx
The last two are a bit old. If they don't easily compile, their source might point you in the right direction.
Remember, an email can contain an attachment that is an email that contains an attachment, etc, etc... At some point, Regex will let you down.
PHP Regular Expressions (REGEX) Multipart MIME (NOT-EMAIL)
Alternatively, you could parse with explode() this should be much faster, it's not too complex, and it gives you the header info if you want it:
<?php
$body = file_get_contents('output.txt');
$boundary = '__NEXT_PART_gc0p4Jq0M2Yt08jU534c0p__';
$parts = explode("--$boundary", $body);
array_shift($parts); # delete up to the first boundary
array_pop($parts); # delete after the last boundary
$binaries = array();
foreach($parts as $part) {
list($header, $binary) = explode("\n\n", $part, 2);
$binaries[] = $binary;
}
print_r($binaries);
Parsing email headers with regular expressions in python
Your demo text is practicallly the mbox format, which can be perfectly processed with the appropriate object in the mailbox
module:
from mailbox import mbox
import re
PAT_EMAIL = re.compile(r"[0-9A-Za-z._-]+\@[0-9A-Za-z._-]+")
mymbox = mbox("demo.txt")
for email in mymbox.values():
from_address = PAT_EMAIL.findall(email["from"])
to_address = PAT_EMAIL.findall(email["to"])
date = [ email["date"], ]
print ";".join(from_address + to_address + date)
Regular expression for matching content between lines in ruby
Ok, so the solution for this was pretty simple, I ended up with an expression like the following:
--Apple-Mail=_EFA7D6C2-C778-4C8E-AA13-C97DF1FA9036.+?(On \\d{0,2}[\\/\\-]\\d{0,2}[\\/\\-]\\d{0,4}.+?)--Apple-Mail=_EFA7D6C2-C778-4C8E-AA13-C97DF1FA9036
No need to perform a look-ahead/behind for this.
Related Topics
How to Use 'Debugger' and 'Pry' When Developing a Gem? (Ruby)
Managing Conflicting Versions of Ruby Gems
How to Sign Out in a Rails App, Using Devise Gem, No Route Matches /Users/Sign_Out
Ruby, Value Bucketing, Beautify Code
Make a Ruby Script with Text Io Double Clickable Executable File
Scope That Has Three Levels Deep Joins
Ruby Parallel Assignment, Step Question
Test Whether a Ruby Class Is a Subclass of Another Class
How to Turn on SQL Debug Logging for Activerecord in Rspec Tests
Enter & Ioerror: Byte Oriented Read for Character Buffered Io
Loaderror - Cannot Open Shared Object File - File Is Present, But It Says No Such File
How to Mimic Browser X509 Client Certificate Verification Without Access to Http Layer
Replacing an Element in Nested Array Ruby
Rspec - How to Test Activerecord::Recordnotfound
Iterating Over the Registers of a Yardoc '@Macro'