Strip Signatures and Replies from Emails

Strip signatures and replies from emails

If your system is in-house and/or you have a limited number of reply formats, it's possible to do a pretty good job. Here are the filters we have set up for email responses to trac tickets:

Drop all text after and including:

  1. Lines that equal '-- \n' (standard email sig delimiter)
  2. Lines that equal '--\n' (people often forget the space in sig delimiter; and this is not that common outside sigs)
  3. Lines that begin with '-----Original Message-----' (MS Outlook default)
  4. Lines that begin with '________________________________' (32 underscores, Outlook again)
  5. Lines that begin with 'On ' and end with ' wrote:\n' (OS X Mail.app default)
  6. Lines that begin with 'From: ' (failsafe four Outlook and some other reply formats)
  7. Lines that begin with 'Sent from my iPhone'
  8. Lines that begin with 'Sent from my BlackBerry'

Numbers 3 and 4 are 'begin with' instead of 'equals' because sometimes users will squash lines together on accident.

We try to be more liberal about stripping out replies, since it's much more of an annoyance (to us) have reply garbage than it is to correct missing text.

Anybody have other formats from the wild that they want to share?

Get the actual email message that the person just wrote, excluding any quoted text

There are many libraries out there that can help you extract the reply/signature from a message:

  • Ruby: https://github.com/github/email_reply_parser
  • Python: https://github.com/zapier/email-reply-parser or https://github.com/mailgun/talon
  • JavaScript: https://github.com/turt2live/node-email-reply-parser
  • Java: https://github.com/Driftt/EmailReplyParser
  • PHP: https://github.com/willdurand/EmailReplyParser

I've also read that Mailgun has a service to parse inbound email and POST its content to a URL of your choice. It will automatically strip quoted text from your emails: https://www.mailgun.com/blog/handle-incoming-emails-like-a-pro-mailgun-api-2-0/

Hope this helps!

How does Gmail recognize email signatures (alternatively, What's the best way to recognize email signatures?)

Email signatures are supposed to be started with two dashes, a space, and a newline.
See Wikipedia and RFC-3676

How to parse an email signature to get the details separately?

I guess the solution for this is not just few lines of code. I think it requires some kind of special processing dedicated for this, something like a signature parser or NLP. This question has been open from august I guess its time to close it now.

E-mail trimming

As far as I know, there is no specification to that. So, the only way to do that would be to track the message ids and then diff the messages against each other, to find out, which parts were part of an earlier message.

That is what the mac mail client does.

But that does not seem to be, what you had in mind.

So: as to my knowledge: There is no magic marker in the body for what you are planning.

As for the signature:

It is good practice to have a double minus + newline before the signature. But you cannot always rely on that.

Is it possible to programmatically 'clean' emails?

In email, there is couple of agreed markings that mean something you wish to strip. You can look for these lines using regular expressions. I doubt you can't really well "sanitize" your emails, but some things you can look for:

  1. Line starting with "> " (greater than then whitespace) marks a quote
  2. Line with "-- " (two hyphens then whitespace then linefeed) marks the beginning of a signature, see Signature block on Wikipedia
  3. Multipart messages, boundaries start with --, beyond that you need to do some searching to separate the message body parts from unwanted parts (like base64 images)

As for an actual C# implementation, I leave that for you or other SOers.

PHP Mail Parsing - Exclude a signature

I would add a unique boundary right below your reply body. So your reply-raw-message would always be something like:

$unique_boundary = md5(time());

$raw_message = $your_reply . $unique_boundary . $the_rest;

Then

$explode = explode($unique_boundary, $raw_message);

and in your DB:

$reply = $explode[0];
$sql = "INSERT INTO email_table (message_content) VALUES ($reply)";

Based on user feedback, a solution that would work regardless of where your reply spatially occurs in the message chain:

 $raw_message = $start_content . $unique_boundary . $your_reply . $unique_boundary . $the_rest;

In this scenario, where your reply is not necessarily at the top of the email chain, your reply always be will be $explode[1].



Related Topics



Leave a reply



Submit