Pandas Populate New Dataframe Column Based on Matching Columns in Another Dataframe

Pandas populate new dataframe column based on matching columns in another dataframe

Consider the following dataframes df and df2

df = pd.DataFrame(dict(
        AUTHOR_NAME=list('AAABBCCCCDEEFGG'),
        title=      list('zyxwvutsrqponml')
    ))

df2 = pd.DataFrame(dict(
        AUTHOR_NAME=list('AABCCEGG'),
        title      =list('zwvtrpml'),
        CATEGORY   =list('11223344')
    ))

option 1

merge

df.merge(df2, how='left')

option 2

join

cols = ['AUTHOR_NAME', 'title']
df.join(df2.set_index(cols), on=cols)

both options yield

enter image description here

Pandas: Add a new column in a data frame based on a value in another data frame

print (df1)
   userId gender
0       1      F
1       2      M
2       3      F
3       4      M
4       5      M
5       6      M

print (df2)
   userId  itemClicked  ItemBought        date
0       1       123182      123212  02/02/2016
1       3       234256      123182  05/02/2016
2       5       986834      234256  04/19/2016
3       4       787663      787663  05/12/2016
4      20       465738      465738  03/20/2016
5       4       787223      787663  07/12/2016

You can use map:

df2['gender'] = df2.userId.map(df1.set_index('userId')['gender'].to_dict())

print (df2)
   userId  itemClicked  ItemBought        date gender
0       1       123182      123212  02/02/2016      F
1       3       234256      123182  05/02/2016      F
2       5       986834      234256  04/19/2016      M
3       4       787663      787663  05/12/2016      M
4      20       465738      465738  03/20/2016    NaN
5       4       787223      787663  07/12/2016      M

Another solution with merge and left join, parameter on can be omit if only column gender is same in both DataFrames:

df = pd.merge(df2, df1, how='left')

print (df)
   userId  itemClicked  ItemBought        date gender
0       1       123182      123212  02/02/2016      F
1       3       234256      123182  05/02/2016      F
2       5       986834      234256  04/19/2016      M
3       4       787663      787663  05/12/2016      M
4      20       465738      465738  03/20/2016    NaN
5       4       787223      787663  07/12/2016      M

Timings:

#len(df2) = 600k
df2 = pd.concat([df2]*100000).reset_index(drop=True)

def f(df1,df2):
    df2['gender'] = df2.userId.map(df1.set_index('userId')['gender'].to_dict())
    return df2


In [43]: %timeit f(df1,df2)
10 loops, best of 3: 34.2 ms per loop

In [44]: %timeit (pd.merge(df2, df1, how='left'))
10 loops, best of 3: 102 ms per loop

Add a column to pandas dataframe based on value present in different dataframe

You can use .isin(), as follows:

A['df_b_presence'] = A['ID'].isin(B['ID'])

Result:

print(A)

    ID  color  df_b_presence
0   5    red          False
1   6   blue          False
2   7   blue           True
3   8    NaN          False
4   9  green           True
5  10    NaN           True

New column based on matching values from another dataframe pandas

Check with stack df1's list columns after re-create with DataFrame then map the value from df2

Also since you asking not using for loop I am using sum , and sum for this case is much slower than *for loop* or itertools

s=pd.DataFrame(df1.column2.tolist()).stack()
df1['New']=s.map(df2.set_index('column3').column4).sum(level=0).apply(set)
df1
Out[36]: 
  column1    column2              New
0      a1     [A, B]     {2, 4, 3, 1}
1      a2  [A, B, C]  {3, 5, 4, 2, 1}
2      a3     [B, C]     {4, 3, 1, 5}

As I mentioned and most of us suggested , also you can check with For loops with pandas - When should I care?

import itertools
d=dict(zip(df2.column3,df2.column4))


l=[set(itertools.chain(*[d[y] for y in x ])) for x in df1.column2.tolist()]
df1['New']=l

Fill column of a dataframe from another dataframe

Use drop_duplicates with set_index and combine_first:

df = df2.set_index('Col1').combine_first(df1.drop_duplicates().set_index('Col1')).reset_index()

If need check dupes only in id column:

df = df2.set_index('Col1').combine_first(df1.drop_duplicates().set_index('Col1')).reset_index()

how do I succinctly create a new dataframe column based on matching existing column values with list of values?

Use str.extract: create a regex pattern of your search words and try to extract the matched pattern:

pattern = fr"\b({'|'.join(search_words1)})\b"
df3['col4'] = df3['col3'].str.extract(pattern)

Pattern:

>>> print(pattern)
\b(man|red)\b

\b matches the empty string, but only at the beginning or end of a word. The ( ) is the capture group.

Pandas Populate New Dataframe Column Based on Matching Columns in Another Dataframe