Extract All Email Addresses from Some .Txt Documents Using Ruby

extract all email addresses from some .txt documents using ruby

Depending on the nature of your .txt documents, you don't have to use one of the complicated regexes that attempt to validate email addresses. You're not trying to validate anything. You're just trying to grab what's already there. Generally speaking, a regex to grab what's already there can be much simpler than a regex that needs to validate input.

An important question is whether your .txt documents contain @ signs that are not part of an email address you want to extract.

This regex handles your first two requirements:

\w+@[\w.-]+|\{(?:\w+, *)+\w+\}@[\w.-]+

Or if you want to allow any sequence of non-space characters containing an @ sign, plus your second requirement (which has spaces):

\S+@\S+|\{(?:\w+, *)+\w+\}@[\w.-]+

Extract email addresses from a block of text

Howabout this for a (slightly) better regular expression

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b

You can find this here:

Email Regex

Just an FYI, the problem with your email is that you allow only one type of separator before or after an email address. You would match "@" alone, if separated by spaces.

R gsub to extract emails from text

We can try the str_extract() from stringr package:

str_extract(text, "\\S*@\\S*")

[1] "Saolonm@hotmail.com"              
[2] "26.leonard@gmail.com"             
[3] "jcdavola31@gmail.com"             
[4] "andrescarnederes@headset.cl"      
[5] "luciana.chavela.ecuador@gmail.com"

where \\S* match any number of non-space character.

Regex to extract method names from file using ruby

str =<<_
@Test(priority =10)
@RunAsClient
public void metodName(){
   ...
}

   ...

@Test(priority =     20)
public void otherMethodName(){
   ...
}
...
_

r = /
    \@Test\(\s*priority\s*=\s*\d+\s*\)\s*\n  # Match string
    (?:\@RunAsClient\n)?  # Optionally match string
    (?:\w+\s+)+           # Match (word, >= 1 spaces) >= 1 times
    \K                    # Forget everything matched so far
    \w+                   # match word
    (?=                   # begin positive lookahead
      (?:\([^)]*\)\s*\{)  # match paren-enclosed expression, >= 0 spaces, {
      |                   # or
      (?:\s*\{)           # match >= 0 spaces, {
    )                     # end positive lookahead
    /x                    # extended/free-spacing regex definition mode

str.scan r
  #=> ["metodName", "otherMethodName"]

Trying to open / access the text in an attachment to an email using ruby gems

The text file attachment will be Base64 encoded. So you should be able to just decode it like this.

puts current_mail.attachments.each{|a| a.decode_body}
=>"can we see this?"

Select most common domain from list of email addresses

email_list = 10.times.map { emails }
  #=> ["alfred.grass426@gmail.com", "elisa.oak239@icloud.com",
  #    "daniel.fruit1600@outlook.com", "ana.fruit3761@icloud.com",
  #    "daniel.grass742@yahoo.com", "elisa.oak3891@outlook.com",
  #    "alfred.leaf1321@gmail.com", "alfred.grass5295@outlook.com",
  #    "ramzes.fruit435@gmail.com", "ana.fruit4233@yahoo.com"] 

email_list.group_by { |s| s[/@\K.+/] }.max_by { |_,v| v.size }.first
  #=> "gmail.com"

\K in the regex means disregard everything matched so far. Alternatively, @\K could be replaced by the positive lookbehind (?<=@).

The steps are as follows.

h = email_list.group_by { |s| s[/@\K.+/] }
  #=> {"gmail.com"  =>["alfred.grass426@gmail.com", "alfred.leaf1321@gmail.com",
  #                    "ramzes.fruit435@gmail.com"],
  #    "icloud.com" =>["elisa.oak239@icloud.com", "ana.fruit3761@icloud.com"],
  #    "outlook.com"=>["daniel.fruit1600@outlook.com",  "elisa.oak3891@outlook.com",
  #                    "alfred.grass5295@outlook.com"],
  #    "yahoo.com"  =>["daniel.grass742@yahoo.com", "ana.fruit4233@yahoo.com"]}
a = h.max_by { |_,v| v.size }
  #=> ["gmail.com", ["alfred.grass426@gmail.com", "alfred.leaf1321@gmail.com",
  #                  "ramzes.fruit435@gmail.com"]] 
a.first
  #=> "gmail.com"

If, as here, there is a tie for most frequent, modify the code as follows to get all winners.

h = email_list.group_by { |s| s[/@\K.+/] }
  # (same as above)
mx_size = h.map { |_,v| v.size }.max
  #=> 3 
h.select { |_,v| v.size == mx_size }.keys
  #=> ["gmail.com", "outlook.com"]

Will this regex for email work for all emails?

short answer: NO. not ALL emails can be checked by regex. there's a thread somewhere here on SO, where they explain this much better than i could if i attempted. I think the only way to check if email is really an email is to contact the mail server and enquire whether user account exists.

please, have a read here: https://stackoverflow.com/a/1373724/81520

Export entire html table to a text document using Watir

To get HTML of the entire table (if it is the only table on the page):

browser.table.html

You will get something like this:

=> "<table border=\"1\" cellpadding=\"2\">\n<tbody><tr>\n<th> Address </th>\n<th> Council tax band </th>\n<th> Annual council tax </th>\n</tr>\n\n<tr>\n<td> 2, STONELEIGH AVENUE, COVENTRY, CV5 6BZ </td>\n<td align=\"center\"> F </td>\n<td align=\"center\"> £2125 </td>\n</tr>\n\n</tbody></table>"

To get HTML of each row and put it in an array:

browser.table.trs.collect {|tr| tr.html}

=> ["<tr>\n<th> Address </th>\n<th> Council tax band </th>\n<th> Annual council tax </th>\n</tr>",
    "<tr>\n<td> 2, STONELEIGH AVENUE, COVENTRY, CV5 6BZ </td>\n<td align=\"center\"> F </td>\n<td align=\"center\"> £2125 </td>\n</tr>"]

To get text of each cell and put it in an array:

browser.table.trs.collect {|tr| [tr[0].text, tr[1].text, tr[2].text]}
=> [["Address", "Council tax band", "Annual council tax"],
    ["2, STONELEIGH AVENUE, COVENTRY, CV5 6BZ", "F", "£2125"]]

To write text of each cell to file:

content = b.table.trs.collect {|tr| [tr[0].text, tr[1].text, tr[2].text]}
File.open("table.txt", "w") {|file| file.puts content}

The file will look like this:

Address
Council tax band
Annual council tax
2, STONELEIGH AVENUE, COVENTRY, CV5 6BZ
F
£2125

Get names of all files from a folder with Ruby

You also have the shortcut option of

Dir["/path/to/search/*"]

and if you want to find all Ruby files in any folder or sub-folder:

Dir["/path/to/search/**/*.rb"]

Extract All Email Addresses from Some .Txt Documents Using Ruby