How to Extract All Upper from a String - Python

How to extract all UPPER from a string? Python

Using list comprehension:

>>> s = 'abcdefgABCDEFGHIJKLMNOP'
>>> ''.join([c for c in s if c.isupper()])
'ABCDEFGHIJKLMNOP'

Using generator expression:

>>> ''.join(c for c in s if c.isupper())
'ABCDEFGHIJKLMNOP

You can also do it using regular expressions:

>>> re.sub('[^A-Z]', '', s)
'ABCDEFGHIJKLMNOP'

Extracting all uppercase words following each other from string

You can use itertools.groupby:

import itertools
s = "A B c de F G A"
new_s = [' '.join(b) for a, b in itertools.groupby(s.split(), key=str.isupper) if a]

Output:

['A B', 'F G A']

How to extract only uppercase substring from pandas series?

How about:

 df['feat'] = df.col.str.extract('([A-Z_]+)').fillna('')

Output:

                                 col           feat
0 cat
1 cat.COUNT(example) COUNT
2 cat.N_MOST_COMMON(example.ord)[2] N_MOST_COMMON

How to extract uppercase and title case string sections into separate columns

  • Use pandas.Series.str.extractall, which will extract multiple capture groups in the regex pattern, as new columns.
  • The pattern may also extract extra whitespace, which must be removed with .str.strip()
    • Without strip: df.iloc[0, 2] → 'DREGGHE '
import pandas as pd

# sample dataframe
data = {'Naam aanvrager': ['DREGGHE Joannes', 'MAHIEU Leo', 'NIEUWENHUIJSE', 'COPPENS', 'VERBURGHT Cornelis', 'NUYTTENS Adriaen', 'DE LARUELLE Pieter', 'VAN VIJVER', 'SILBO Martinus', 'STEEMAERE Anthone']}
df = pd.DataFrame(data)

# extract names
df[['First Name','Last Name']] = df['Naam aanvrager'].str.extractall(r'(\b[A-Z ]+\b)(\w+)*').reset_index()[[1,0]]

# the pattern to extract the Last Name may include extra whitespace, which can be removed as follows
df['Last Name'] = df['Last Name'].str.strip()

# display(df)
Naam aanvrager First Name Last Name
0 DREGGHE Joannes Joannes DREGGHE
1 MAHIEU Leo Leo MAHIEU
2 NIEUWENHUIJSE NaN NIEUWENHUIJSE
3 COPPENS NaN COPPENS
4 VERBURGHT Cornelis Cornelis VERBURGHT
5 NUYTTENS Adriaen Adriaen NUYTTENS
6 DE LARUELLE Pieter Pieter DE LARUELLE
7 VAN VIJVER NaN VAN VIJVER
8 SILBO Martinus Martinus SILBO
9 STEEMAERE Anthone Anthone STEEMAERE

How to extract the uppercase as well as some substring from pandas dataframe using extract?

For feat, since you already got the answer to agg in your other StackOverflow question, I think you can use the following to extract two different series based off two different patterns that are separated with | and then fillna() one series with another.

  1. ^([^A-Z]*$) should only return the full string if the full string is lowercase
  2. [^a-z].*example\.([a-z]+)\).*$ should only return strings after example. and before ) only if there is uppercase in the string prior to example.


df = pd.DataFrame({'item': ['num','bool', 'cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})

s = df['item'].str.extract('^([^A-Z]*$)|[^a-z].*example\.([a-z]+)\).*$', expand=True)
df['feat'] = s[0].fillna(s[1]).fillna('')
df
Out[1]:
item feat
0 num num
1 bool bool
2 cat cat
3 cat.COUNT(example)
4 cat.N_MOST_COMMON(example.ord)[2] ord
5 cat.FIRST(example.ord) ord
6 cat.FIRST(example.num) num

The above gives you the output you are looking for your sample data and holds to your conditions. However:

  1. What if there are UPPERCASE after example.? Current output would return ''

see example #2 below with some of the data changed according to above point:

df = pd.DataFrame({'item': ['num','cat.count(example.AAA)', 'cat.count(example.aaa)', 'cat.count(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})

s = df['item'].str.extract('^([^A-Z]*$)|[^a-z].*example\.([a-z]+)\).*$', expand=True)
df['feat'] = s[0].fillna(s[1]).fillna('')
df
Out[2]:
item feat
0 num num
1 cat.count(example.AAA)
2 cat.count(example.aaa) cat.count(example.aaa)
3 cat.count(example) cat.count(example)
4 cat.N_MOST_COMMON(example.ord)[2] ord
5 cat.FIRST(example.ord) ord
6 cat.FIRST(example.num) num

How to extract Uppercase word in a string in all rows of a column in a pandas dataframe?

Try using this :

def cust_func(data):
## split the transcription with , delimiter - later we will join
words = data.split(",")

## get index of words which are completely in uppercase and also endswith :,
column_idx = []
for i in range(len(words)):
if ((words[i].endswith(":") or words[i].endswith(": ")) and words[i].isupper()):
column_idx.append(i)

## Find the sentence for each of the capital word by joining the words
## between two consecutive capital words
## Save the cap word and the respective sentence in dict.
result = {}
for i in range(len(column_idx)):
if i != len(column_idx)-1:
result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:column_idx[i+1]])
else:
result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:])
return(pd.Series(result)) ## this creates new columns

df = pd.concat([df, df.transcription.apply(cust_func)], axis=1)
df

Output looks like this (Couldn't capture all the columns in one screenshot.):

Sample Image

Sample Image

Extract consecutive uppercase words from a column of strings in Python

You can use the following:

words = '(?:ubicado|encuentra|direccion)'
regex = words+'[^A-Z]*([^a-z]+)'

data['extraction'] = data['strings'].str.extract(regex)

Output:

                                                   strings                         extraction
0 ubicado en QUINTA CALLE, LADO NORTE detras QUINTA CALLE, LADO NORTE
1 encuentra por AVENIDA NORTE, ARRIBA DE IGLESIA frente a AVENIDA NORTE, ARRIBA DE IGLESIA
2 direccion en CENTRO COMERCIAL, SEGUNDO NIVEL junto a CENTRO COMERCIAL, SEGUNDO NIVEL

Or, to avoid trailing non-letter characters:

words = '(?:ubicado|encuentra|direccion)'
regex = words+'[^A-Z]*([^a-z]*[A-Z]+)'

data['extraction'] = data['strings'].str.extract(regex)

Output:

                                                   strings                        extraction
0 ubicado en QUINTA CALLE, LADO NORTE detras QUINTA CALLE, LADO NORTE
1 encuentra por AVENIDA NORTE, ARRIBA DE IGLESIA frente a AVENIDA NORTE, ARRIBA DE IGLESIA
2 direccion en CENTRO COMERCIAL, SEGUNDO NIVEL junto a CENTRO COMERCIAL, SEGUNDO NIVEL


Related Topics



Leave a reply



Submit