How to Extract Data from Text Field in Pandas Dataframe

How to extract data from text field in pandas dataframe?

There is problem your data in column tags are strings, no dictionaries.

So need first step:

import ast

df['tags'] = df['tags'].apply(ast.literal_eval)

and then apply original answer, working very nice if multiple fields.

Verifying:

df=pd.DataFrame([
[43,{"tags":[],"image":["https://image.com/Kqk.jpg"]}],
[83,{"tags":["yourself","start",""],"image":["https://images.com/test.jpg"]}],
[76,{"tags":["en","webcom"],"links":["http://webcom.webcomdb.com","http://webcom.webcomstats.com"],"users":["otole"]}],
[77,{"tags":["webcomznakomstvo","webcomzhiznx","webcomistoriya","webcomosebe","webcomfotografiya"],"image":["https://images.com/nt4wzguoh/y_a3d735b4.jpg","https://images.com/sucb0u24x/b1sd_Naju.jpg"]}],
[81,{"tags":["webcomfotografiya"],"users":["myself","boattva"],"links":["https://webcom.com/nk"]}],
],columns=["_id","tags"])
#print (df)

#convert column to string for verify solution
df['tags'] = df['tags'].astype(str)

print (df['tags'].apply(type))
0 <class 'str'>
1 <class 'str'>
2 <class 'str'>
3 <class 'str'>
4 <class 'str'>
Name: tags, dtype: object

#convert back
df['tags'] = df['tags'].apply(ast.literal_eval)

print (df['tags'].apply(type))
0 <class 'dict'>
1 <class 'dict'>
2 <class 'dict'>
3 <class 'dict'>
4 <class 'dict'>
Name: tags, dtype: object

c = Counter([len(x['tags']) for x in df['tags']])

df = pd.DataFrame({'Number of posts':list(c.values()), ' Number of tags ': list(c.keys())})
print (df)
Number of posts Number of tags
0 1 0
1 1 3
2 1 2
3 1 5
4 1 1

How to extract specific text from a pandas column

Split by the opening square bracket and pick first index value in the resulting list.

df['store'] = df.store.str.split('\[').str[0]

How to extract text from a column in pandas

Use str.split:

df = pd.DataFrame({'Col1': ['1_A01_1_1_NA', '11_B40_11_8_NA']})
out = df['Col1'].str.split('_', expand=True)

Output:

>>> out
0 1 2 3 4
0 1 A01 1 1 NA
1 11 B40 11 8 NA

Extract a column from text file and store it in dataframe in Python

first of all you need to convert the txt to csv, after this you can read it with pandas and turn them to the dataframe :

import glob
import pandas as pd

for each in glob.glob('*.txt'):
with open(each , 'r') as file:
content = file.readlines()
with open('{}.csv'.format(each[0:-4]) , 'w') as file:
file.writelines(content)

for each in glob.glob('*.csv'):
dataframe = pd.read_csv(each , skiprows=0 , header=None , index_col= 0)

then:

dataframe.reset_index(inplace=True)

output:

>>>print(dataframe[0])
0 O.U20
1 O.Z20
2 O.H21
3 O.M21
4 S3.U20
5 S3.Z20
Name: 0, dtype: object

How to extract text from specific rows based on column in pandas?

Change

policy = data.loc[data['table_col_name']=='agent']

to

policy = data.loc[data['table_col_name']=='agent', ['node_name', 'src_name']]

The pandas documentation explains how you can index your dataframes:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selection-by-label

Extracting text from elements in Pandas column, writing to new column

It isn't clear, why second parentheses doesn't match. Maybe because char !.

Then you can use extract with regular expression.

Regular expression \(([A-Za-z0-9 _]+)\) means:

  1. \( matches a literal ( character
  2. ( begins a new group
  3. [A-Za-z0-9 _] is a character set matching any letter (capital or lower case), digit or underscore and space
  4. + matches the preceding element (the character set) one or more times.
  5. ) ends the group
  6. \) matches a literal ) character

Second parentheses isn't matched, because regex exclude character ! - it isn't in brackets [A-Za-z0-9 _].

import pandas as pd
import numpy as np
import io

temp=u"""(info) text (yay!)
I love text
Text is fun
(more info) more text
lotsa text (boo!)"""

df = pd.read_csv(io.StringIO(temp), header=None, names=['original'])
print df
# original
#0 (info) text (yay!)
#1 I love text
#2 Text is fun
#3 (more info) more text
#4 lotsa text (boo!)

df['col1'] = df['original'].str.extract(r"\(([A-Za-z0-9 _]+)\)")
df['col2'] = df['original'].str.replace(r"\(([A-Za-z0-9 _]+)\)", "")
print df
# original col1 col2
#0 (info) text (yay!) info text (yay!)
#1 I love text NaN I love text
#2 Text is fun NaN Text is fun
#3 (more info) more text more info more text
#4 lotsa text (boo!) NaN lotsa text (boo!)


Related Topics



Leave a reply



Submit