How to extract data from text field in pandas dataframe?
There is problem your data in column tags
are strings
, no dictionaries
.
So need first step:
import ast
df['tags'] = df['tags'].apply(ast.literal_eval)
and then apply original answer, working very nice if multiple fields.
Verifying:
df=pd.DataFrame([
[43,{"tags":[],"image":["https://image.com/Kqk.jpg"]}],
[83,{"tags":["yourself","start",""],"image":["https://images.com/test.jpg"]}],
[76,{"tags":["en","webcom"],"links":["http://webcom.webcomdb.com","http://webcom.webcomstats.com"],"users":["otole"]}],
[77,{"tags":["webcomznakomstvo","webcomzhiznx","webcomistoriya","webcomosebe","webcomfotografiya"],"image":["https://images.com/nt4wzguoh/y_a3d735b4.jpg","https://images.com/sucb0u24x/b1sd_Naju.jpg"]}],
[81,{"tags":["webcomfotografiya"],"users":["myself","boattva"],"links":["https://webcom.com/nk"]}],
],columns=["_id","tags"])
#print (df)
#convert column to string for verify solution
df['tags'] = df['tags'].astype(str)
print (df['tags'].apply(type))
0 <class 'str'>
1 <class 'str'>
2 <class 'str'>
3 <class 'str'>
4 <class 'str'>
Name: tags, dtype: object
#convert back
df['tags'] = df['tags'].apply(ast.literal_eval)
print (df['tags'].apply(type))
0 <class 'dict'>
1 <class 'dict'>
2 <class 'dict'>
3 <class 'dict'>
4 <class 'dict'>
Name: tags, dtype: object
c = Counter([len(x['tags']) for x in df['tags']])
df = pd.DataFrame({'Number of posts':list(c.values()), ' Number of tags ': list(c.keys())})
print (df)
Number of posts Number of tags
0 1 0
1 1 3
2 1 2
3 1 5
4 1 1
How to extract specific text from a pandas column
Split by the opening square bracket and pick first index value in the resulting list.
df['store'] = df.store.str.split('\[').str[0]
How to extract text from a column in pandas
Use str.split
:
df = pd.DataFrame({'Col1': ['1_A01_1_1_NA', '11_B40_11_8_NA']})
out = df['Col1'].str.split('_', expand=True)
Output:
>>> out
0 1 2 3 4
0 1 A01 1 1 NA
1 11 B40 11 8 NA
Extract a column from text file and store it in dataframe in Python
first of all you need to convert the txt to csv, after this you can read it with pandas and turn them to the dataframe :
import glob
import pandas as pd
for each in glob.glob('*.txt'):
with open(each , 'r') as file:
content = file.readlines()
with open('{}.csv'.format(each[0:-4]) , 'w') as file:
file.writelines(content)
for each in glob.glob('*.csv'):
dataframe = pd.read_csv(each , skiprows=0 , header=None , index_col= 0)
then:
dataframe.reset_index(inplace=True)
output:
>>>print(dataframe[0])
0 O.U20
1 O.Z20
2 O.H21
3 O.M21
4 S3.U20
5 S3.Z20
Name: 0, dtype: object
How to extract text from specific rows based on column in pandas?
Change
policy = data.loc[data['table_col_name']=='agent']
to
policy = data.loc[data['table_col_name']=='agent', ['node_name', 'src_name']]
The pandas documentation explains how you can index your dataframes:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selection-by-label
Extracting text from elements in Pandas column, writing to new column
It isn't clear, why second parentheses doesn't match. Maybe because char !
.
Then you can use extract with regular expression.
Regular expression \(([A-Za-z0-9 _]+)\)
means:
\(
matches a literal(
character(
begins a new group[A-Za-z0-9 _]
is a character set matching any letter (capital or lower case), digit or underscore and space+
matches the preceding element (the character set) one or more times.)
ends the group\)
matches a literal)
character
Second parentheses isn't matched, because regex exclude character !
- it isn't in brackets [A-Za-z0-9 _]
.
import pandas as pd
import numpy as np
import io
temp=u"""(info) text (yay!)
I love text
Text is fun
(more info) more text
lotsa text (boo!)"""
df = pd.read_csv(io.StringIO(temp), header=None, names=['original'])
print df
# original
#0 (info) text (yay!)
#1 I love text
#2 Text is fun
#3 (more info) more text
#4 lotsa text (boo!)
df['col1'] = df['original'].str.extract(r"\(([A-Za-z0-9 _]+)\)")
df['col2'] = df['original'].str.replace(r"\(([A-Za-z0-9 _]+)\)", "")
print df
# original col1 col2
#0 (info) text (yay!) info text (yay!)
#1 I love text NaN I love text
#2 Text is fun NaN Text is fun
#3 (more info) more text more info more text
#4 lotsa text (boo!) NaN lotsa text (boo!)
Related Topics
Python Tkinter How to Update a Text Widget in a for Loop
Is There an Easy Way in Python to Wait Until Certain Condition Is True
Python: How to Split a List Based on a Specific Element
How to Tell Python to Convert Integers into Words
Unit Testing a Method With No Return Value
Could Not Find a Version That Satisfies the Requirement in Python
Removing Punctuations and Spaces in a String Without Using Regex
How to Split an Array According to Conditional Statement
How to Retrieve Data from Dynamic Table - Selenium Python
How to Get Local Issuer Certificate When Using Requests in Python
How to Prevent Brokenpipeerror When Doing a Flush in Python
Convert Float to Float Time in Python
How to Convert an Integer to Time
How to Test If a Column Exists and Is Not Null in a Dataframe
Faster Way to Read Excel Files to Pandas Dataframe