Pandas populate new dataframe column based on matching columns in another dataframe
Consider the following dataframes df
and df2
df = pd.DataFrame(dict(
AUTHOR_NAME=list('AAABBCCCCDEEFGG'),
title= list('zyxwvutsrqponml')
))
df2 = pd.DataFrame(dict(
AUTHOR_NAME=list('AABCCEGG'),
title =list('zwvtrpml'),
CATEGORY =list('11223344')
))
option 1merge
df.merge(df2, how='left')
option 2join
cols = ['AUTHOR_NAME', 'title']
df.join(df2.set_index(cols), on=cols)
both options yield
Pandas: Add a new column in a data frame based on a value in another data frame
print (df1)
userId gender
0 1 F
1 2 M
2 3 F
3 4 M
4 5 M
5 6 M
print (df2)
userId itemClicked ItemBought date
0 1 123182 123212 02/02/2016
1 3 234256 123182 05/02/2016
2 5 986834 234256 04/19/2016
3 4 787663 787663 05/12/2016
4 20 465738 465738 03/20/2016
5 4 787223 787663 07/12/2016
You can use map
:
df2['gender'] = df2.userId.map(df1.set_index('userId')['gender'].to_dict())
print (df2)
userId itemClicked ItemBought date gender
0 1 123182 123212 02/02/2016 F
1 3 234256 123182 05/02/2016 F
2 5 986834 234256 04/19/2016 M
3 4 787663 787663 05/12/2016 M
4 20 465738 465738 03/20/2016 NaN
5 4 787223 787663 07/12/2016 M
Another solution with merge
and left join, parameter on
can be omit if only column gender
is same in both DataFrames:
df = pd.merge(df2, df1, how='left')
print (df)
userId itemClicked ItemBought date gender
0 1 123182 123212 02/02/2016 F
1 3 234256 123182 05/02/2016 F
2 5 986834 234256 04/19/2016 M
3 4 787663 787663 05/12/2016 M
4 20 465738 465738 03/20/2016 NaN
5 4 787223 787663 07/12/2016 M
Timings:
#len(df2) = 600k
df2 = pd.concat([df2]*100000).reset_index(drop=True)
def f(df1,df2):
df2['gender'] = df2.userId.map(df1.set_index('userId')['gender'].to_dict())
return df2
In [43]: %timeit f(df1,df2)
10 loops, best of 3: 34.2 ms per loop
In [44]: %timeit (pd.merge(df2, df1, how='left'))
10 loops, best of 3: 102 ms per loop
Add a column to pandas dataframe based on value present in different dataframe
You can use .isin()
, as follows:
A['df_b_presence'] = A['ID'].isin(B['ID'])
Result:
print(A)
ID color df_b_presence
0 5 red False
1 6 blue False
2 7 blue True
3 8 NaN False
4 9 green True
5 10 NaN True
New column based on matching values from another dataframe pandas
Check with stack
df1's list columns after re-create with DataFrame
then map
the value from df2
Also since you asking not using for loop I am using sum
, and sum
for this case is much slower than *for loop*
or itertools
s=pd.DataFrame(df1.column2.tolist()).stack()
df1['New']=s.map(df2.set_index('column3').column4).sum(level=0).apply(set)
df1
Out[36]:
column1 column2 New
0 a1 [A, B] {2, 4, 3, 1}
1 a2 [A, B, C] {3, 5, 4, 2, 1}
2 a3 [B, C] {4, 3, 1, 5}
As I mentioned and most of us suggested , also you can check with For loops with pandas - When should I care?
import itertools
d=dict(zip(df2.column3,df2.column4))
l=[set(itertools.chain(*[d[y] for y in x ])) for x in df1.column2.tolist()]
df1['New']=l
Fill column of a dataframe from another dataframe
Use drop_duplicates
with set_index
and combine_first
:
df = df2.set_index('Col1').combine_first(df1.drop_duplicates().set_index('Col1')).reset_index()
If need check dupes only in id
column:
df = df2.set_index('Col1').combine_first(df1.drop_duplicates().set_index('Col1')).reset_index()
how do I succinctly create a new dataframe column based on matching existing column values with list of values?
Use str.extract
: create a regex pattern of your search words and try to extract the matched pattern:
pattern = fr"\b({'|'.join(search_words1)})\b"
df3['col4'] = df3['col3'].str.extract(pattern)
Pattern:
>>> print(pattern)
\b(man|red)\b
\b
matches the empty string, but only at the beginning or end of a word. The ( )
is the capture group.
Related Topics
Python Selenium - Element Is Not Currently Interactable and May Not Be Manipulated
How to Determine If My Python Shell Is Executing in 32Bit or 64Bit
How to Restart a Program Based on User Input
Converting a List into Comma Separated and Add Quotes in Python
How to Overwrite the Previous Print to Stdout
How to Count the Amount of Sentences in a Paragraph in Python
How to Select Last Row and Also How to Access Pyspark Dataframe by Index
Python: Opencv - Selecting Region of an Image
Regular Expression to Check Whitespace in the Beginning and End of a String
Python: How to Print Separate Lines from a List
How to Run Python Script from Another Machine Without Installing Imported Modules
How to Select the Last Column of Dataframe
Adding Different Sized/Shaped Displaced Numpy Matrices
How to Convert Strings With Billion or Million Abbreviation into Integers in a List
Easiest Way to Replace a String Using a Dictionary of Replacements
How to Loop Over Multiple Dataframes and Produce Multiple List