Extract Email Sub-Strings from Large Document

Extract email sub-strings from large document

This code extracts the email addresses in a string. Use it while reading line by line

>>> import re
>>> line = "should we use regex more often? let me know at jdsk@bob.com.lol"
>>> match = re.search(r'[\w.+-]+@[\w-]+\.[\w.-]+', line)
>>> match.group(0)
'jdsk@bob.com.lol'

If you have several email addresses use findall:

>>> line = "should we use regex more often? let me know at  jdsk@bob.com.lol or popop@coco.com"
>>> match = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', line)
>>> match
['jdsk@bob.com.lol', 'popop@coco.com']

The regex above probably finds the most common non-fake email address. If you want to be completely aligned with the RFC 5322 you should check which email addresses follow the specification. Check this out to avoid any bugs in finding email addresses correctly.


Edit: as suggested in a comment by @kostek:
In the string Contact us at support@example.com. my regex returns support@example.com. (with dot at the end). To avoid this, use [\w\.,]+@[\w\.,]+\.\w+)

Edit II: another wonderful improvement was mentioned in the comments: [\w\.-]+@[\w\.-]+\.\w+which will capture example@do-main.com as well.

Edit III: Added further improvements as discussed in the comments: "In addition to allowing + in the beginning of the address, this also ensures that there is at least one period in the domain. It allows multiple segments of domain like abc.co.uk as well, and does NOT match bad@ss :). Finally, you don't actually need to escape periods within a character class, so it doesn't do that."

How to scrape valid emails from a file using Regex in Python?

In regex, this part [gmail|hotmail|outlook]+ essentially means 'match one or more of any of these characters: g,m,a,i,l,h,o,t,k,u,|. What you need is a regex group (?:...) like this: r'[a-zA-Z0-9_.-]+[^!#$%^&*()]@(?:gmail|hotmail|outlook)\.com' And because the . in the .com means any character followed by com, you need to escape it with \

regex extract email from strings

You can create a function with regex /([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)/ to extract email ids from long text

function extractEmails (text) {
return text.match(/([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)/gi);
}

Script in action: Run to see result

var text = `boleh di kirim ke email saya ekoprasetyo.crb@outlook.com tks... boleh minta kirim ke db.maulana@gmail.com. dee.wien@yahoo.com. . 
deninainggolan@yahoo.co.id Senior Quantity Surveyor
Fajar.rohita@hotmail.com, terimakasih bu Cindy Hartanto
firmansyah1404@gmail.com saya mau dong bu cindy
fransiscajw@gmail.com
Hi Cindy ...pls share the Salary guide to donny_tri_wardono@yahoo.co.id thank a`;

function extractEmails ( text ){
return text.match(/([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)/gi);
}

$("#emails").text(extractEmails(text));
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<p id="emails"></p>

Finding first index after symbol

This is a very non-trivial approach without using regular expression: you can reverse the string.

s = 'Application for training - customer@gmail.com Some notes'
s_rev = s[::-1]

# Now you are looking for "moc." and this is the starting point:
s_rev.find("moc.")
-> 11

# Then you can search for the next "space" after this index:
s_rev.find(" ", 11)
-> 29

# Then you can find the email from the reversed string:
s_rev[11:29]
-> 'moc.liamg@remotsuc'

# Finally reverse it back:
s_rev[11:29][::-1]
-> 'customer@gmail.com'

As a one-liner:

s[::-1][s[::-1].find("moc."):s[::-1].find(" ", s[::-1].find("moc."))][::-1]

Note that the second find is looking for a space after the email address, which is the example you gave. You might ask what if the string ends with the email? That's fine, since in that case find will return -1 which is the end of the string, thus you are still able to get the correct email address. The only exception is, there are other characters followed by the email address (i.e., a comma).



Related Topics



Leave a reply



Submit