Extract Text Name from String

Extract text name from String

Since you're trying to use stringr, I recommend str_extract (I'd recommend it even if you weren't trying to use stringr):

x <- c('RED LOBTSER CA04606', 'Red Lobster NewYork WY245')
str_extract(x, '[a-zA-Z ]+\\b')
# [1] "RED LOBSTER " "Red Lobster NewYork "

The '\b' in the regex prevents the 'CA' from 'CA04606' being extracted.

If you don't like that trailing space you could use str_trim to remove it, or you could modify the regex:

str_extract(x, '[a-zA-Z]+(?: +[a-zA-Z]+)*\\b')
# [1] "RED LOBSTER" "Red Lobster NewYork"

Note - if your string has non-numbers after the post code, the above only returns the words before. So in the example below, if you wanted to get the 'NewYork' after the 'WY245', you can use str_extract_all and paste the results together:

x <- c(x, 'Red Lobster WY245 NewYork')
str_extract_all(x, '[a-zA-Z]+(?: +[a-zA-Z]+)*\\b')
# [[1]]
# [1] "RED LOBSTER"
#
# [[2]]
# [1] "Red Lobster NewYork"
#
# [[3]]
# [1] "Red Lobster" "NewYork"

# Paste the bits together with paste(..., collapse=' ')
sapply(str_extract_all(x, '[a-zA-Z]+(?: +[a-zA-Z]+)*\\b'), paste, collapse=' ')
# [1] "RED LOBSTER" "Red Lobster NewYork" "Red Lobster NewYork"

Extracting part of name from string

String operations are a real pain in SQL Server. I recommend using APPLY to define intermediate values, if you have to use them:

select v1.lastname, rest, 
(case when v1.rest like '%,%'
then left(rest, charindex(',', rest) - 1)
else left(rest, len(rest) - charindex(' ', reverse(rest)))
end) as firstname,
(case when v1.rest like '%,%'
then stuff(rest, 1, charindex(',', rest) + 1, '')
else stuff(rest, 1, len(rest) - charindex(' ', reverse(rest)) + 1, '')
end) as lastname
from t cross apply
(values (left(names, charindex(',', names) - 1), stuff(names, 1, charindex(',', names) + 1, ''))
) v1(lastname, rest);

Here is a db<>fiddle.

How can I extract text from string in python?

Use nltk.tokenize

import nltk
sentences = nltk.sent_tokenize(txt)

This will give you a list of sentences.

Text Analysis: extracting person name and quotation: how to create a pattern

The first pattern is all right. It matches "Homer Simpson" as well as "here we" but since you only return group 0 this is fine.

The second pattern has some issues. Since you open the string with " and use the same " inside the string, python thinks the string ended there. You can observe this from the colors of the characters changing from green (strings) to black (not strings) back to green.

quote_pattern = "["]\w+["]"

You can prevent this by starting (and ending) your string with single quotation marks ' like this:

quote_pattern ='["]\w+["]'

However, this does still not match the provided quote. This is because \w matches any word character (equivalent to [a-zA-Z0-9_]) but does not match the comma ,, the points . or the whitespaces .
Therefore you could change the pattern to

quote_pattern ='["].*["]'

Where .* matches anything.
You can further simplify the expression by removing the square brackes. They are not needed in this case since they contain only one element.

quote_pattern ='".*"'

You need to return the quote without the surrounding quotation marks. Therefore you can create a capure group in the expression using ():

quote_pattern ='"(.*)"'

This way the quotations marks are still needed to match but a group is created which does not contain them. This group is going to have index 1 instead of the 0 you use at the moment:

extracted_quotation = quote.group(1) 

This should lead to the desired result.

Check out this website for some interactive regex action: https://regex101.com/

Extract user name and email from text file using powershell

Ok, I'll break down my comment. Code in my comment:

gci $path\*.eml|%{gc $_ -raw|?{$_ -match '(?ms)BILLING ADDRESS\s+(\S.+?)[\r\n].+?[\r\n](\S+@\S+)'}|%{[pscustomobject]@{FirstName=$Matches[1].split(' ')[0];LastName=$Matches[1].Split(' ')[-1];Email=$Matches[2]}}

I'll start with defining the various aliases that I used to keep it short in the comment:

gci -> Get-ChildItem
% -> ForEach-Object
gc -> Get-Content
? -> Where-Object

Formatted a little more nicely and not using the aliases it would look like this:

Get-ChildItem $path\*.eml|
ForEach-Object{
Get-Content $_ -raw |
Where-Object{$_ -match '(?ms)BILLING ADDRESS\s+(\S.+?)[\r\n].+?[\r\n](\S+@\S+)'}|
ForEach-Object{
[pscustomobject]@{
FirstName=$Matches[1].split(' ')[0];
LastName=$Matches[1].Split(' ')[-1];
Email=$Matches[2]
}
}
}

This starts out with Get-ChildItem, and that's just searching for *.eml at the path defined in $path. Nothing super complex there, moving on.

Next we hit a ForEach-Object loop. I actually run two loops here, one nested inside the other, so for the outer loop we are concerned with processing files one at a time as they're found. So, for each file the first thing it does is:

Get-Content $_ -raw

That command gets the content of the file that was passed to it as a multi-line string. This allows us to do a single search matching multiple groups against the entire email at once, which is what we do in the next part:

Where-Object{$_ -match '(?ms)BILLING ADDRESS\s+(\S.+?)[\r\n].+?[\r\n](\S+@\S+)'}

This says we only want emails that match the specified RegEx pattern. I'll let you see how RegEx 101 breaks it down if you need the RegEx (Regular Expression) explained. The match has two capturing groups in it, and those get populated into the automatic $Matches variable for each iteration of a ForEach-Object loop that the results are passed to. The way that works is that it populates $Matches with an array, where the entire string matched is the first item, then each capturing group is an additional item in the array. In our case with your given example that would be:

$Matches[0]
BILLING ADDRESS

Joe Some Blow
123 Nowhere
Someplace, TX 75075
joeblow@nowhere.org

$Matches[1]
Joe Some Blow

$Matches[2]
joeblow@nowhere.org

Then I just loop through the results of that to utilize the $Matches results, and build an object for each result.

ForEach-Object{
[pscustomobject]@{
FirstName=$Matches[1].split(' ')[0];
LastName=$Matches[1].Split(' ')[-1];
Email=$Matches[2]
}
}

In that I use the first capture group (Joe Some Blow), split it on the spaces with .split(' '), and use the first result of the split for the first name, and the last result for the last name. I grab the second capturing group for the email address. Then it's just a last } (which is missing in my comment) to close the outer ForEach-Object loop.



Related Topics



Leave a reply



Submit