How to Extract All Upper from a String - Python

How to extract all UPPER from a string? Python

Using list comprehension:

>>> s = 'abcdefgABCDEFGHIJKLMNOP'
>>> ''.join([c for c in s if c.isupper()])
'ABCDEFGHIJKLMNOP'

Using generator expression:

>>> ''.join(c for c in s if c.isupper())
'ABCDEFGHIJKLMNOP

You can also do it using regular expressions:

>>> re.sub('[^A-Z]', '', s)
'ABCDEFGHIJKLMNOP'

Extracting all uppercase words following each other from string

You can use itertools.groupby:

import itertools
s = "A B c de F G A"
new_s = [' '.join(b) for a, b in itertools.groupby(s.split(), key=str.isupper) if a]

Output:

['A B', 'F G A']

How to extract only uppercase substring from pandas series?

How about:

 df['feat'] = df.col.str.extract('([A-Z_]+)').fillna('')

Output:

                                 col           feat
0                                cat               
1                 cat.COUNT(example)          COUNT
2  cat.N_MOST_COMMON(example.ord)[2]  N_MOST_COMMON

How to extract uppercase and title case string sections into separate columns

Use pandas.Series.str.extractall, which will extract multiple capture groups in the regex pattern, as new columns.
The pattern may also extract extra whitespace, which must be removed with .str.strip()
- Without strip: df.iloc[0, 2] → 'DREGGHE '

import pandas as pd

# sample dataframe
data = {'Naam aanvrager': ['DREGGHE Joannes', 'MAHIEU Leo', 'NIEUWENHUIJSE', 'COPPENS', 'VERBURGHT Cornelis', 'NUYTTENS Adriaen', 'DE LARUELLE Pieter', 'VAN VIJVER', 'SILBO Martinus', 'STEEMAERE Anthone']}
df = pd.DataFrame(data)

# extract names
df[['First Name','Last Name']] = df['Naam aanvrager'].str.extractall(r'(\b[A-Z ]+\b)(\w+)*').reset_index()[[1,0]]

# the pattern to extract the Last Name may include extra whitespace, which can be removed as follows
df['Last Name'] = df['Last Name'].str.strip()

# display(df)
       Naam aanvrager First Name      Last Name
0     DREGGHE Joannes    Joannes        DREGGHE
1          MAHIEU Leo        Leo         MAHIEU
2       NIEUWENHUIJSE        NaN  NIEUWENHUIJSE
3             COPPENS        NaN        COPPENS
4  VERBURGHT Cornelis   Cornelis      VERBURGHT
5    NUYTTENS Adriaen    Adriaen       NUYTTENS
6  DE LARUELLE Pieter     Pieter    DE LARUELLE
7          VAN VIJVER        NaN     VAN VIJVER
8      SILBO Martinus   Martinus          SILBO
9   STEEMAERE Anthone    Anthone      STEEMAERE

How to extract the uppercase as well as some substring from pandas dataframe using extract?

For feat, since you already got the answer to agg in your other StackOverflow question, I think you can use the following to extract two different series based off two different patterns that are separated with | and then fillna() one series with another.

^([^A-Z]*$) should only return the full string if the full string is lowercase
[^a-z].*example\.([a-z]+)\).*$ should only return strings after example. and before ) only if there is uppercase in the string prior to example.

df = pd.DataFrame({'item': ['num','bool', 'cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})

s = df['item'].str.extract('^([^A-Z]*$)|[^a-z].*example\.([a-z]+)\).*$', expand=True)
df['feat'] = s[0].fillna(s[1]).fillna('')
df
Out[1]: 
                                item  feat
0                                num   num
1                               bool  bool
2                                cat   cat
3                 cat.COUNT(example)      
4  cat.N_MOST_COMMON(example.ord)[2]   ord
5             cat.FIRST(example.ord)   ord
6             cat.FIRST(example.num)   num

The above gives you the output you are looking for your sample data and holds to your conditions. However:

What if there are UPPERCASE after example.? Current output would return ''

see example #2 below with some of the data changed according to above point:

df = pd.DataFrame({'item': ['num','cat.count(example.AAA)', 'cat.count(example.aaa)', 'cat.count(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})

s = df['item'].str.extract('^([^A-Z]*$)|[^a-z].*example\.([a-z]+)\).*$', expand=True)
df['feat'] = s[0].fillna(s[1]).fillna('')
df
Out[2]: 
                                item                    feat
0                                num                     num
1             cat.count(example.AAA)                        
2             cat.count(example.aaa)  cat.count(example.aaa)
3                 cat.count(example)      cat.count(example)
4  cat.N_MOST_COMMON(example.ord)[2]                     ord
5             cat.FIRST(example.ord)                     ord
6             cat.FIRST(example.num)                     num

How to extract Uppercase word in a string in all rows of a column in a pandas dataframe?

Try using this :

def cust_func(data):
    ## split the transcription with , delimiter - later we will join 
    words = data.split(",")
    
    ## get index of words which are completely in uppercase and also endswith :, 
    column_idx = []
    for i in range(len(words)):
        if ((words[i].endswith(":") or words[i].endswith(": ")) and words[i].isupper()):
            column_idx.append(i)
          
    ## Find the sentence for each of the capital word by joining the words
    ## between two consecutive capital words
    ## Save the cap word and the respective sentence in dict. 
    result = {}
    for i in range(len(column_idx)):
        if i != len(column_idx)-1:
            result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:column_idx[i+1]])
        else:
            result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:])
    return(pd.Series(result)) ## this creates new columns

df = pd.concat([df, df.transcription.apply(cust_func)], axis=1)
df

Output looks like this (Couldn't capture all the columns in one screenshot.):

Sample Image

Extract consecutive uppercase words from a column of strings in Python

You can use the following:

words = '(?:ubicado|encuentra|direccion)'
regex = words+'[^A-Z]*([^a-z]+)'

data['extraction'] = data['strings'].str.extract(regex)

Output:

                                                   strings                         extraction
0               ubicado en QUINTA CALLE, LADO NORTE detras          QUINTA CALLE, LADO NORTE 
1  encuentra por AVENIDA NORTE, ARRIBA DE IGLESIA frente a  AVENIDA NORTE, ARRIBA DE IGLESIA 
2     direccion en CENTRO COMERCIAL, SEGUNDO NIVEL junto a   CENTRO COMERCIAL, SEGUNDO NIVEL

Or, to avoid trailing non-letter characters:

words = '(?:ubicado|encuentra|direccion)'
regex = words+'[^A-Z]*([^a-z]*[A-Z]+)'

data['extraction'] = data['strings'].str.extract(regex)

Output:

                                                   strings                        extraction
0               ubicado en QUINTA CALLE, LADO NORTE detras          QUINTA CALLE, LADO NORTE
1  encuentra por AVENIDA NORTE, ARRIBA DE IGLESIA frente a  AVENIDA NORTE, ARRIBA DE IGLESIA
2     direccion en CENTRO COMERCIAL, SEGUNDO NIVEL junto a   CENTRO COMERCIAL, SEGUNDO NIVEL

How to Extract All Upper from a String - Python