Updating a Subset of a Dataframe

Modifying a subset of rows in a pandas dataframe

Use .loc for label based indexing:

df.loc[df.A==0, 'B'] = np.nan

The df.A==0 expression creates a boolean series that indexes the rows, 'B' selects the column. You can also use this to transform a subset of a column, e.g.:

df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2

I don't know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. According to documentation here, this behavior depends on the underlying numpy behavior. I've found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting.

Updating a subset of a dataframe

df[df[,1] %in% search.df, 2] <- 100

or if you want to use column elements of the data frame directly

df$col.2[df$col.1 %in% search.df] <- 100

For simplicity, the same broken down:

# get index of rows to be updated by checking each value 
# in col1 against search.df => e.g. FALSE, TRUE, FALSE, ...
index <- df[,1] %in% search.df

# update col2 where index is TRUE to a new value
df[index, 2] <- 100

Efficient way to update column value for subset of rows on Pandas DataFrame?

This may be what you require:

 df.loc[df.name.str.len() == 4, 'value'] *= 1000

df.loc[df.name.str.len() == 4, 'value'] = 'short_' + df['value'].astype(str)

How to update a subset of Pandas DataFrame rows with new (different) values?

If I understand your problem correctly then you want to change the values in column C based on values in column A and the actual value assigned to C is looked up in a dictionary but still you want to leave those rows untouched where a value in A is not present in the dictionary mapping.

Dictionary m is used for mapping values from column A to the target value:

df = pandas.DataFrame({'A': [1,2,3,4,5,6,7,8,9], 'C': [0,0,0,0,0,0,0,0,0]})
m = {1:1,3:1,6:1,8:1}

Then you need to select all rows in A that match the keys of the dictionary using select. Then you map the values of column A using m and assign the result to the filtered values of column C. The other values remain like before.

select = df['A'].isin(m.keys())
df.loc[select, 'C'] = df.loc[select, 'A'].map(m)

Updating filtered data frame in pandas

EDIT: If need replace only missing values by another DataFrame use DataFrame.fillna or DataFrame.combine_first:

df = df_1.fillna(df_2)
#alternative
#df = df_1.combine_first(df_2)

print (df)
Name Surname
index
R222 Katrin Johnes
R343 John Doe
R377 Steven Walkins
R914 Marie Sklodowska-Curie

It not working, because update subset of DataFrame inplace, possible ugly solution is update filtered DataFrame df and add not matched original rows:

m = (df_1["Name"].notna()) & (df_1["Surname"].notna())
df = df_1[m].copy()

df.update(df_2)

df = pd.concat([df, df_1[~m]]).sort_index()
print (df)
Name Surname
index
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 NaN NaN

Possible solution without update:

m = (df_1["Name"].notna()) & (df_1["Surname"].notna())

df_1[m] = df_2
print (df_1)
Name Surname
index
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 NaN NaN

pandas python Update subset of column A based on subset of one or more other columns

You can do this an easier way by using pandas .loc

Initialize dataframe:

df = pd.DataFrame({'group':['e','e','e','h','h','h'],
'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
'cats':[1, 1, 1, 5, 2, 2],
'jobs':[1, 1, 1, 64, 64, 64],
'rank':[-1, -1, -1, -1, -1, -1],
'topvalue':[100, 0, 4, 37, 0, 3.9],
'freq':[1, 1, 1, 58, 63, 61]
})

We want to rank jobs feature so we just isolate the rank locations using .loc, and then on the right side of the assignment, we isolate the jobs column using .loc and use the .rank() function

Rank job feature, by jobs value:

df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)

Rank failure feature by frequency where top value is not 0:

For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.

This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0 ranks stay as -1

df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)

This way we don't filter out the 0s.

df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)

Update a subset of dataframe rows and columns from another dataframe

Here's a fully atomized data.table version to update all columns that present in both data sets and adding columns from df2 that are not present in df1 simultaneously. This will update df1 in place

cols <- setdiff(colnames(df2), "x")
setDT(df1)[setDT(df2), (cols) := mget(paste0('i.', cols)), on = "x"]
df1
# w x y z
# 1: 1 a 1 3
# 2: 2 b 2 4
# 3: 3 b 2 4
# 4: 4 c 1 NA

The idea behind paste0('i.', cols) is to tell data.table that we want to take the columns from the data.table located in the ith location (df2) so it will know how to handle columns that present in both data sets.


Disclaimer: The idea was borrowed from this @eddi's answer

How to update a subset of a MultiIndexed pandas DataFrame

Note: In soon to be released 0.13 a drop_level argument has been added to xs (thanks to this question!):

In [42]: df.xs('sat', level='day', drop_level=False)
Out[42]:
sales
year flavour day
2008 strawberry sat 10

Another option is to use select (which extracts a sub-DataFrame (copy) of the same data, i.e. it has the same index and so can be updated correctly):

In [11]: d.select(lambda x: x[2] == 'sat') * 2
Out[11]:
sales
year flavour day
2008 strawberry sat 20
banana sat 44
2009 strawberry sat 22
banana sat 46

In [12]: d.update(d.select(lambda x: x[2] == 'sat') * 2)

Another option is to use an apply:

In [21]: d.apply(lambda x: x*2 if x.name[2] == 'sat' else x, axis=1)

Another option is to use get_level_values (this is probably the most efficient way of these):

In [22]: d[d.index.get_level_values('day') == 'sat'] *= 2

Another option is promote the 'day' level to a column and then use an apply.

How to update the original dataframe after a subset (slicing) calculation?

Adding update at the end of for loop

for w in ['one','two','three','four']:
x = df.loc[df['a']==w]
size = x.iloc[:]['a'].count()
print("Records %s: %s" %(w,size))
target_column = x.columns.get_loc('c')
for i in range(0,size):
idx = x.index
acum = x.iloc[i:i+3,target_column].sum()
x.loc[x.loc[idx,'sum_c_3'].index[i],'sum_c_3'] = acum
print (x)
df.update(x)# here is the one need to add

df
Out[979]:
a b c sum_c_3
0 one x 0.127171 0.210872
1 one y -0.576157 1.212010
2 one x 0.659859 1.788168
3 one y 1.128309 1.128309
4 two x 0.333521 -0.846657
5 two y 0.753613 -1.180178
6 two x -1.933791 -1.933791
7 three x 0.549009 0.549009
8 four x 0.895742 0.895742


Related Topics



Leave a reply



Submit