Efficient way to update column value for subset of rows on Pandas DataFrame?
This may be what you require:
df.loc[df.name.str.len() == 4, 'value'] *= 1000
df.loc[df.name.str.len() == 4, 'value'] = 'short_' + df['value'].astype(str)
pandas python Update subset of column A based on subset of one or more other columns
You can do this an easier way by using pandas .loc
Initialize dataframe:
df = pd.DataFrame({'group':['e','e','e','h','h','h'],
'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
'cats':[1, 1, 1, 5, 2, 2],
'jobs':[1, 1, 1, 64, 64, 64],
'rank':[-1, -1, -1, -1, -1, -1],
'topvalue':[100, 0, 4, 37, 0, 3.9],
'freq':[1, 1, 1, 58, 63, 61]
})
We want to rank jobs feature so we just isolate the rank locations using .loc
, and then on the right side of the assignment, we isolate the jobs column using .loc
and use the .rank()
function
Rank job feature, by jobs value:
df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)
Rank failure feature by frequency where top value is not 0:
For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.
This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0
ranks stay as -1
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)
This way we don't filter out the 0s.
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)
How to update a subset of Pandas DataFrame rows with new (different) values?
If I understand your problem correctly then you want to change the values in column C based on values in column A and the actual value assigned to C is looked up in a dictionary but still you want to leave those rows untouched where a value in A is not present in the dictionary mapping.
Dictionary m is used for mapping values from column A to the target value:
df = pandas.DataFrame({'A': [1,2,3,4,5,6,7,8,9], 'C': [0,0,0,0,0,0,0,0,0]})
m = {1:1,3:1,6:1,8:1}
Then you need to select all rows in A that match the keys of the dictionary using select. Then you map the values of column A using m and assign the result to the filtered values of column C. The other values remain like before.
select = df['A'].isin(m.keys())
df.loc[select, 'C'] = df.loc[select, 'A'].map(m)
Modifying a subset of rows in a pandas dataframe
Use .loc
for label based indexing:
df.loc[df.A==0, 'B'] = np.nan
The df.A==0
expression creates a boolean series that indexes the rows, 'B'
selects the column. You can also use this to transform a subset of a column, e.g.:
df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2
I don't know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. According to documentation here, this behavior depends on the underlying numpy behavior. I've found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting.
Update subset of values in a dataframe column
Could use ifelse
here. Assuming data frame is named df1
:
df1$x <- ifelse(df1$y %in% c("a", "b"), df1$x - 1, df1$x)
How to update column value of a data frame from another data frame matching 2 columns?
Here's a way to do it:
df1 = df1.join(df2.drop(columns='DEP ID').set_index(['Team ID', 'Group']), on=['Team ID', 'Group'])
df1.loc[df1.Result.notna(), 'Score'] = df1.Result
df1 = df1.drop(columns='Result')
Explanation:
- modify df2 so it has
Team ID, Group
as its index and its only column isResult
- use
join
to bring the new scores from df2 into aResult
column in df1 - use
loc
to updateScore
values for rows whereResult
is not null (i.e., rows for which an updatedScore
is available) - drop the
Result
column.
Full test code:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'DEP ID':['001','001','002','002'],
'Team ID':['002','004','002','007'],
'Group':['A','A','A','A'],
'Score':[50,70,50,90]})
df2 = pd.DataFrame({
'DEP ID':['001','001','001'],
'Team ID':['002','003','004'],
'Group':['A','A','A'],
'Result':[80,60,70]})
print(df1)
print(df2)
df1 = df1.join(df2.drop(columns='DEP ID').set_index(['Team ID', 'Group']), on=['Team ID', 'Group'])
df1.loc[df1.Result.notna(), 'Score'] = df1.Result
df1 = df1.drop(columns='Result')
print(df1)
Output:
index DEP ID Team ID Group Score
0 0 001 002 A 80
1 1 001 004 A 70
2 2 002 002 A 80
3 3 002 007 A 90
UPDATE:
If Result
column in df2 is instead named Score
, as asked by OP in a comment, then the code can be adjusted slightly as follows:
df1 = df1.join(df2.drop(columns='DEP ID').set_index(['Team ID', 'Group']), on=['Team ID', 'Group'], rsuffix='_NEW')
df1.loc[df1.Score_NEW.notna(), 'Score'] = df1.Score_NEW
df1 = df1.drop(columns='Score_NEW')
Update a subset of values in data.table column with values from another data.table column
You can use get
to grab the i.name
variable programmatically in the update join, and stay within standard data.table join operations. Example data and code:
library(data.table)
data <- data.table(snp.gene.key=1:5, dval = letters[1:5])
all_tmp <- data.table(snp.gene.key=1:3, dval=letters[11:13])
setkey(data, snp.gene.key)
setkey(all_tmp, snp.gene.key)
data
# snp.gene.key dval
#1: 1 a
#2: 2 b
#3: 3 c
#4: 4 d
#5: 5 e
Then specify (name)
on the RHS of the :=
assignment so it is interpreted rather than treated literally, along with using get
on the LHS to grab the variable you want for the update join.
name <- "dval"
data[all_tmp, (name) := get(paste0("i.", name)) ]
data
# snp.gene.key dval
#1: 1 k
#2: 2 l
#3: 3 m
#4: 4 d
#5: 5 e
Updating filtered data frame in pandas
EDIT: If need replace only missing values by another DataFrame use DataFrame.fillna
or DataFrame.combine_first
:
df = df_1.fillna(df_2)
#alternative
#df = df_1.combine_first(df_2)
print (df)
Name Surname
index
R222 Katrin Johnes
R343 John Doe
R377 Steven Walkins
R914 Marie Sklodowska-Curie
It not working, because update subset of DataFrame inplace, possible ugly solution is update filtered DataFrame df
and add not matched original rows:
m = (df_1["Name"].notna()) & (df_1["Surname"].notna())
df = df_1[m].copy()
df.update(df_2)
df = pd.concat([df, df_1[~m]]).sort_index()
print (df)
Name Surname
index
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 NaN NaN
Possible solution without update
:
m = (df_1["Name"].notna()) & (df_1["Surname"].notna())
df_1[m] = df_2
print (df_1)
Name Surname
index
R222 Pablo Picasso
R343 Jarque Berry
R377 Christofer Bishop
R914 NaN NaN
updating column values in pandas based on condition
There is logic problem:
reviews = pd.DataFrame({'Score':range(6)})
print (reviews)
Score
0 0
1 1
2 2
3 3
4 4
5 5
If set all values higher like 3
to 1
it working like need:
reviews.loc[reviews['Score'] > 3, 'Score'] = 1
print (reviews)
Score
0 0
1 1
2 2
3 3
4 1
5 1
Then all vallues without 3
are set to 0
, so also are replaced 1
from reviews['Score'] > 3
:
reviews.loc[reviews['Score'] <= 2, 'Score'] = 0
print (reviews)
Score
0 0
1 0
2 0
3 3
4 0
5 0
Last are removed 3
rows and get only 0
values:
reviews.drop(reviews[reviews['Score'] == 3].index, inplace = True)
print (reviews)
Score
0 0
1 0
2 0
4 0
5 0
You can change solution:
reviews = pd.DataFrame({'Score':range(6)})
print (reviews)
Score
0 0
1 1
2 2
3 3
4 4
5 5
First removed 3
by filter all rows not equal to 3
in boolean indexing
:
reviews = reviews[reviews['Score'] != 3].copy()
And then are set values to 0
and 1
:
reviews['Score'] = (reviews['Score'] > 3).astype(int)
#alternative
reviews['Score'] = np.where(reviews['Score'] > 3, 1, 0)
print (reviews)
Score
0 0
1 0
2 0
4 1
5 1
EDIT1:
Your solution should be changed with swap lines - first set 0
and then 1
for avoid overwrite values:
reviews.loc[reviews['Score'] <= 2, 'Score'] = 0
reviews.loc[reviews['Score'] > 3, 'Score'] = 1
reviews.drop(reviews[reviews['Score'] == 3].index, inplace = True)
print (reviews)
Score
0 0
1 0
2 0
4 1
5 1
Related Topics
Parallel Processing in R Limited
Installing Package from a Local .Tar.Gz File on Linux
How to Define Fill Colours in Ggplot Histogram
Missing Data When Supplying a Dual-Axis--Multiple-Traces to Subplot
Parsing Iso8601 Date and Time Format in R
Convert Table into Matrix by Column Names
How to Capture the Output of System()
Why Does Subsetting a Column from a Data Frame VS. a Tibble Give Different Results
Get Names of Column with Max Value for Each Row
Coerce Logical (Boolean) Vector to 0 and 1
Differencebetween Short (&,|) and Long (&&, ||) Forms of And, or Logical Operators in R
Force Facet_Wrap to Fill Bottom Row (And Leave Any "Gaps" in the Top Row)
R Shiny Loop to Display Multiple Plots
Extract Hyperlink from Excel File in R
Converting to Date in a Character Column That Contains Two Date Formats