Modifying a subset of rows in a pandas dataframe
Use .loc
for label based indexing:
df.loc[df.A==0, 'B'] = np.nan
The df.A==0
expression creates a boolean series that indexes the rows, 'B'
selects the column. You can also use this to transform a subset of a column, e.g.:
df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2
I don't know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. According to documentation here, this behavior depends on the underlying numpy behavior. I've found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting.
Updating a subset of a dataframe
df[df[,1] %in% search.df, 2] <- 100
or if you want to use column elements of the data frame directly
df$col.2[df$col.1 %in% search.df] <- 100
For simplicity, the same broken down:
# get index of rows to be updated by checking each value
# in col1 against search.df => e.g. FALSE, TRUE, FALSE, ...
index <- df[,1] %in% search.df
# update col2 where index is TRUE to a new value
df[index, 2] <- 100
Efficient way to update column value for subset of rows on Pandas DataFrame?
This may be what you require:
df.loc[df.name.str.len() == 4, 'value'] *= 1000
df.loc[df.name.str.len() == 4, 'value'] = 'short_' + df['value'].astype(str)
How to update a subset of Pandas DataFrame rows with new (different) values?
If I understand your problem correctly then you want to change the values in column C based on values in column A and the actual value assigned to C is looked up in a dictionary but still you want to leave those rows untouched where a value in A is not present in the dictionary mapping.
Dictionary m is used for mapping values from column A to the target value:
df = pandas.DataFrame({'A': [1,2,3,4,5,6,7,8,9], 'C': [0,0,0,0,0,0,0,0,0]})
m = {1:1,3:1,6:1,8:1}
Then you need to select all rows in A that match the keys of the dictionary using select. Then you map the values of column A using m and assign the result to the filtered values of column C. The other values remain like before.
select = df['A'].isin(m.keys())
df.loc[select, 'C'] = df.loc[select, 'A'].map(m)
Updating filtered data frame in pandas
EDIT: If need replace only missing values by another DataFrame use DataFrame.fillna
or DataFrame.combine_first
:
df = df_1.fillna(df_2)
#alternative
#df = df_1.combine_first(df_2)
print (df)
Name Surname
index
R222 Katrin Johnes
R343 John Doe
R377 Steven Walkins
R914 Marie Sklodowska-Curie
It not working, because update subset of DataFrame inplace, possible ugly solution is update filtered DataFrame df
and add not matched original rows:
m = (df_1["Name"].notna()) & (df_1["Surname"].notna())
df = df_1[m].copy()
df.update(df_2)
df = pd.concat([df, df_1[~m]]).sort_index()
print (df)
Name Surname
index
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 NaN NaN
Possible solution without update
:
m = (df_1["Name"].notna()) & (df_1["Surname"].notna())
df_1[m] = df_2
print (df_1)
Name Surname
index
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 NaN NaN
pandas python Update subset of column A based on subset of one or more other columns
You can do this an easier way by using pandas .loc
Initialize dataframe:
df = pd.DataFrame({'group':['e','e','e','h','h','h'],
'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
'cats':[1, 1, 1, 5, 2, 2],
'jobs':[1, 1, 1, 64, 64, 64],
'rank':[-1, -1, -1, -1, -1, -1],
'topvalue':[100, 0, 4, 37, 0, 3.9],
'freq':[1, 1, 1, 58, 63, 61]
})
We want to rank jobs feature so we just isolate the rank locations using .loc
, and then on the right side of the assignment, we isolate the jobs column using .loc
and use the .rank()
function
Rank job feature, by jobs value:
df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)
Rank failure feature by frequency where top value is not 0:
For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.
This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0
ranks stay as -1
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)
This way we don't filter out the 0s.
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)
Update a subset of dataframe rows and columns from another dataframe
Here's a fully atomized data.table
version to update all columns that present in both data sets and adding columns from df2
that are not present in df1
simultaneously. This will update df1
in place
cols <- setdiff(colnames(df2), "x")
setDT(df1)[setDT(df2), (cols) := mget(paste0('i.', cols)), on = "x"]
df1
# w x y z
# 1: 1 a 1 3
# 2: 2 b 2 4
# 3: 3 b 2 4
# 4: 4 c 1 NA
The idea behind paste0('i.', cols)
is to tell data.table that we want to take the columns from the data.table
located in the i
th location (df2
) so it will know how to handle columns that present in both data sets.
Disclaimer: The idea was borrowed from this @eddi's answer
How to update a subset of a MultiIndexed pandas DataFrame
Note: In soon to be released 0.13 a drop_level
argument has been added to xs (thanks to this question!):
In [42]: df.xs('sat', level='day', drop_level=False)
Out[42]:
sales
year flavour day
2008 strawberry sat 10
Another option is to use select (which extracts a sub-DataFrame (copy) of the same data, i.e. it has the same index and so can be updated correctly):
In [11]: d.select(lambda x: x[2] == 'sat') * 2
Out[11]:
sales
year flavour day
2008 strawberry sat 20
banana sat 44
2009 strawberry sat 22
banana sat 46
In [12]: d.update(d.select(lambda x: x[2] == 'sat') * 2)
Another option is to use an apply:
In [21]: d.apply(lambda x: x*2 if x.name[2] == 'sat' else x, axis=1)
Another option is to use get_level_values
(this is probably the most efficient way of these):
In [22]: d[d.index.get_level_values('day') == 'sat'] *= 2
Another option is promote the 'day' level to a column and then use an apply.
How to update the original dataframe after a subset (slicing) calculation?
Adding update
at the end of for loop
for w in ['one','two','three','four']:
x = df.loc[df['a']==w]
size = x.iloc[:]['a'].count()
print("Records %s: %s" %(w,size))
target_column = x.columns.get_loc('c')
for i in range(0,size):
idx = x.index
acum = x.iloc[i:i+3,target_column].sum()
x.loc[x.loc[idx,'sum_c_3'].index[i],'sum_c_3'] = acum
print (x)
df.update(x)# here is the one need to add
df
Out[979]:
a b c sum_c_3
0 one x 0.127171 0.210872
1 one y -0.576157 1.212010
2 one x 0.659859 1.788168
3 one y 1.128309 1.128309
4 two x 0.333521 -0.846657
5 two y 0.753613 -1.180178
6 two x -1.933791 -1.933791
7 three x 0.549009 0.549009
8 four x 0.895742 0.895742
Related Topics
Print R-Squared for All of the Models Fit with Lmlist
Ggplot2 and Geom_Density: How to Remove Baseline
How to Reverse Legend (Labels and Color) So High Value Starts at Bottom
How to Transpose a Tibble() in R
Draw a Trend Line Using Ggplot
Produce a Table Spanning Multiple Pages Using Kable()
How to Create an Infix %Between% Operator
Change a Column from Birth Date to Age in R
Extract Name of Data.Frame in R as Character
Replace a Subset of a Data Frame with Dplyr Join Operations
Intersecting Points and Polygons in R
Ggplot2: Change Factor Order in Legend
Adding Slight Curve (Or Bend) in Ggplot Geom_Path to Make Path Easier to Read
Error in Chol.Default(Cxx):The Leading Minor of Order Is Not Positive Definite