How to extract all UPPER from a string? Python
Using list comprehension:
>>> s = 'abcdefgABCDEFGHIJKLMNOP'
>>> ''.join([c for c in s if c.isupper()])
'ABCDEFGHIJKLMNOP'
Using generator expression:
>>> ''.join(c for c in s if c.isupper())
'ABCDEFGHIJKLMNOP
You can also do it using regular expressions:
>>> re.sub('[^A-Z]', '', s)
'ABCDEFGHIJKLMNOP'
Extracting all uppercase words following each other from string
You can use itertools.groupby
:
import itertools
s = "A B c de F G A"
new_s = [' '.join(b) for a, b in itertools.groupby(s.split(), key=str.isupper) if a]
Output:
['A B', 'F G A']
How to extract only uppercase substring from pandas series?
How about:
df['feat'] = df.col.str.extract('([A-Z_]+)').fillna('')
Output:
col feat
0 cat
1 cat.COUNT(example) COUNT
2 cat.N_MOST_COMMON(example.ord)[2] N_MOST_COMMON
How to extract uppercase and title case string sections into separate columns
- Use
pandas.Series.str.extractall
, which will extract multiple capture groups in the regex pattern, as new columns. - The pattern may also extract extra whitespace, which must be removed with
.str.strip()
- Without strip:
df.iloc[0, 2] → 'DREGGHE '
- Without strip:
import pandas as pd
# sample dataframe
data = {'Naam aanvrager': ['DREGGHE Joannes', 'MAHIEU Leo', 'NIEUWENHUIJSE', 'COPPENS', 'VERBURGHT Cornelis', 'NUYTTENS Adriaen', 'DE LARUELLE Pieter', 'VAN VIJVER', 'SILBO Martinus', 'STEEMAERE Anthone']}
df = pd.DataFrame(data)
# extract names
df[['First Name','Last Name']] = df['Naam aanvrager'].str.extractall(r'(\b[A-Z ]+\b)(\w+)*').reset_index()[[1,0]]
# the pattern to extract the Last Name may include extra whitespace, which can be removed as follows
df['Last Name'] = df['Last Name'].str.strip()
# display(df)
Naam aanvrager First Name Last Name
0 DREGGHE Joannes Joannes DREGGHE
1 MAHIEU Leo Leo MAHIEU
2 NIEUWENHUIJSE NaN NIEUWENHUIJSE
3 COPPENS NaN COPPENS
4 VERBURGHT Cornelis Cornelis VERBURGHT
5 NUYTTENS Adriaen Adriaen NUYTTENS
6 DE LARUELLE Pieter Pieter DE LARUELLE
7 VAN VIJVER NaN VAN VIJVER
8 SILBO Martinus Martinus SILBO
9 STEEMAERE Anthone Anthone STEEMAERE
How to extract the uppercase as well as some substring from pandas dataframe using extract?
For feat
, since you already got the answer to agg
in your other StackOverflow question, I think you can use the following to extract two different series based off two different patterns that are separated with |
and then fillna()
one series with another.
^([^A-Z]*$)
should only return the full string if the full string is lowercase[^a-z].*example\.([a-z]+)\).*$
should only return strings afterexample.
and before)
only if there is uppercase in the string prior toexample.
df = pd.DataFrame({'item': ['num','bool', 'cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})
s = df['item'].str.extract('^([^A-Z]*$)|[^a-z].*example\.([a-z]+)\).*$', expand=True)
df['feat'] = s[0].fillna(s[1]).fillna('')
df
Out[1]:
item feat
0 num num
1 bool bool
2 cat cat
3 cat.COUNT(example)
4 cat.N_MOST_COMMON(example.ord)[2] ord
5 cat.FIRST(example.ord) ord
6 cat.FIRST(example.num) num
The above gives you the output you are looking for your sample data and holds to your conditions. However:
- What if there are UPPERCASE after
example.
? Current output would return''
see example #2 below with some of the data changed according to above point:
df = pd.DataFrame({'item': ['num','cat.count(example.AAA)', 'cat.count(example.aaa)', 'cat.count(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})
s = df['item'].str.extract('^([^A-Z]*$)|[^a-z].*example\.([a-z]+)\).*$', expand=True)
df['feat'] = s[0].fillna(s[1]).fillna('')
df
Out[2]:
item feat
0 num num
1 cat.count(example.AAA)
2 cat.count(example.aaa) cat.count(example.aaa)
3 cat.count(example) cat.count(example)
4 cat.N_MOST_COMMON(example.ord)[2] ord
5 cat.FIRST(example.ord) ord
6 cat.FIRST(example.num) num
How to extract Uppercase word in a string in all rows of a column in a pandas dataframe?
Try using this :
def cust_func(data):
## split the transcription with , delimiter - later we will join
words = data.split(",")
## get index of words which are completely in uppercase and also endswith :,
column_idx = []
for i in range(len(words)):
if ((words[i].endswith(":") or words[i].endswith(": ")) and words[i].isupper()):
column_idx.append(i)
## Find the sentence for each of the capital word by joining the words
## between two consecutive capital words
## Save the cap word and the respective sentence in dict.
result = {}
for i in range(len(column_idx)):
if i != len(column_idx)-1:
result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:column_idx[i+1]])
else:
result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:])
return(pd.Series(result)) ## this creates new columns
df = pd.concat([df, df.transcription.apply(cust_func)], axis=1)
df
Output looks like this (Couldn't capture all the columns in one screenshot.):
Extract consecutive uppercase words from a column of strings in Python
You can use the following:
words = '(?:ubicado|encuentra|direccion)'
regex = words+'[^A-Z]*([^a-z]+)'
data['extraction'] = data['strings'].str.extract(regex)
Output:
strings extraction
0 ubicado en QUINTA CALLE, LADO NORTE detras QUINTA CALLE, LADO NORTE
1 encuentra por AVENIDA NORTE, ARRIBA DE IGLESIA frente a AVENIDA NORTE, ARRIBA DE IGLESIA
2 direccion en CENTRO COMERCIAL, SEGUNDO NIVEL junto a CENTRO COMERCIAL, SEGUNDO NIVEL
Or, to avoid trailing non-letter characters:
words = '(?:ubicado|encuentra|direccion)'
regex = words+'[^A-Z]*([^a-z]*[A-Z]+)'
data['extraction'] = data['strings'].str.extract(regex)
Output:
strings extraction
0 ubicado en QUINTA CALLE, LADO NORTE detras QUINTA CALLE, LADO NORTE
1 encuentra por AVENIDA NORTE, ARRIBA DE IGLESIA frente a AVENIDA NORTE, ARRIBA DE IGLESIA
2 direccion en CENTRO COMERCIAL, SEGUNDO NIVEL junto a CENTRO COMERCIAL, SEGUNDO NIVEL
Related Topics
How to Iterate Over Dates in a Dataframe
Convert CSV File to Pipe Delimited File in Python
Issue Skipping Song by Requester
Testing Whether a String Has Repeated Characters
How Can Draw a Line Using the X and Y Coordinates of Two Points
How to Convert SQL Query Results into a Python Dictionary
Comparing Two Xml Files in Python
How to Get Rid of the B-Prefix in a String in Python
Running Command With Paramiko Exec_Command Causes Process to Sleep Before Finishing
Arrange a Text File Using Python
Open a Putty Window and Run Ssh Commands - Python
Get Business Days Between Start and End Date Using Pandas
Print 5 Items in a Row on Separate Lines for a List
How to Obtain Second and Fourth Word from Each Line in a File
How to Periodically Execute a Function With Asyncio
How to Extract Rar Files Inside Google Colab