Parsing Outlook .Msg Files With Python

Parsing outlook .msg files with python

This works for me:

import win32com.client
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
msg = outlook.OpenSharedItem(r"C:\test_msg.msg")

print msg.SenderName
print msg.SenderEmailAddress
print msg.SentOn
print msg.To
print msg.CC
print msg.BCC
print msg.Subject
print msg.Body

count_attachments = msg.Attachments.Count
if count_attachments > 0:
for item in range(count_attachments):
print msg.Attachments.Item(item + 1).Filename

del outlook, msg

Please refer to the following post regarding methods to access email addresses and not just the names (ex. "John Doe") from the To, CC and BCC properties - enter link description here

How to import .msg files in Python along with attachments from a local directory

Posting the solution which worked for me (as asked by Amey P Naik).
As mentioned I tried multiple modules but only extract_msg worked for the case in hand.
I created two functions for importing the outlook message text and attachments as a Pandas DataFrame, first function would create one folder each for the email message and second would import the data from message to dataframe. Attachments need to be processed separately using for loop on the sub-directories in the parent directory. Below are the two functions I created with comments:

# 1). Import the required modules and setup working directory

import extract_msg
import os
import pandas as pd
direct = os.getcwd() # directory object to be passed to the function for accessing emails, this is where you will store all .msg files
ext = '.msg' #type of files in the folder to be read

# 2). Create separate folder by email name and extract data

def content_extraction(directory,extension):
for mail in os.listdir(directory):
try:
if mail.endswith(extension):
msg = extract_msg.Message(mail) #This will create a local 'msg' object for each email in direcory
msg.save() #This will create a separate folder for each email inside the parent folder and save a text file with email body content, also it will download all attachments inside this folder.
except(UnicodeEncodeError,AttributeError,TypeError) as e:
pass # Using this as some emails are not processed due to different formats like, emails sent by mobile.

content_extraction(direct,ext)

#3).Import the data to Python DataFrame using the extract_msg module
#note this will not import data from the sub-folders inside the parent directory
#rather it will extract the information from .msg files, you can use a loop instead
#to directly import data from the files saved on sub-folders.

def DataImporter(directory, extension):
my_list = []
for i in os.listdir(direct):
try:
if i.endswith(ext):
msg = extract_msg.Message(i)
my_list.append([msg.filename,msg.sender,msg.to, msg.date, msg.subject, msg.body, msg.message_id]) #These are in-built features of '**extract_msg.Message**' class
global df
df = pd.DataFrame(my_list, columns = ['File Name','From','To','Date','Subject','MailBody Text','Message ID'])
print(df.shape[0],' rows imported')
except(UnicodeEncodeError,AttributeError,TypeError) as e:
pass

DataImporter(direct,ext)

Post running these 2 functions, you will have almost all information inside a Pandas DataFrame, which you can use as per your need. If you also need to extract content from attachments, you need to create a loop for all sub-directories inside the parent directory to read the attachment files as per their format, like in my case the formats were .pdf,.jpg,.png,.csv etc. Getting data from these format will require different techniques like for getting data from pdf you will need Pytesseract OCR module.

If you find an easier way to extract content from attachments, please post your solution here for future reference, if you have any questions, please comment. Also if there is any scope of improvement in the above code, please feel free to highlight.

Reading saved email file with “.msg” extension, in local disk

The code you have now looks to be completely unworkable for what you're trying to accomplish. You need to parse Outlook ".msg" files, which can be done in Python but not using the email module. But if you can use ".eml" files as you mentioned, it will be easier because the email module can read those.

To read .eml files, see email.message_from_file().

Export multiple Outlook .MSG files/datapoints to .CSV using extract-msg Python module

The following adds a loop that opens and looks at all the msg files in the current directory and outputs to a single csv file

import os
import extract_msg
import csv

with open(r'Email.csv', mode='w') as file:
fieldnames = ['Subject', 'Date', 'Sender']
writer = csv.DictWriter(file, fieldnames=fieldnames)

writer.writeheader()

for f in os.listdir('.'):
if not f.endswith('.msg'):
continue

msg = extract_msg.Message(f)
msg_sender = msg.sender
msg_date = msg.date
msg_subj = msg.subject
msg_message = msg.body

writer.writerow({'Subject': msg_subj, 'Date': msg_date, 'Sender': msg_sender})


Related Topics



Leave a reply



Submit