Create a new column with non-null columns' names
One option would be to convert the format from 'wide' to 'long' using melt
. Grouped by 'a', we paste
the 'variable' elements that corresponds to non-zero elements in 'value' (provided as logical condition in 'i').
melt(df, id.var='a')[value!=0,
.(z=paste(variable, collapse="_")), keyby =a]
# a z
#1: 1 b_d
#2: 2 c
#3: 3 b_c
#4: 4 b_d
#5: 5 b_d
Or instead of melt
ing, we can group by 'a', unlist
the Subset of Data.table (.SD
) and paste
the names
of the columns that corresponds to non-zero elements ('i1').
df[, {i1 <- !!unlist(.SD)
paste(names(.SD)[i1], collapse="_")} , by= a]
Benchmarks
set.seed(24)
df1 <- data.table(a=1:1e6, b = sample(0:5, 1e6,
replace=TRUE), c = sample(0:4, 1e6, replace=TRUE),
d = sample(0:3, 1e6, replace=TRUE))
akrun1 <- function() {
melt(df1, id.var='a')[value!=0,
.(z=paste(variable, collapse="_")), keyby =a]
}
akrun2 <- function() {
df1[, {i1 <- !!unlist(.SD)
paste(names(.SD)[i1], collapse="_")} , by= a]
}
ronak <- function() {
data.table(z = lapply(apply(df1, 1, function(x)
which(x[-1]!= 0)),
function(x) paste0(names(x), collapse = "_")))
}
eddi <- function(){
df1[, newcol := gsub("NA_|_NA|NA", "",
do.call(function(...) paste(..., sep = "_"),
Map(function(x, y) x[(y == 0) + 1], names(.SD), .SD)))
, .SDcols = b:d]
}
alexis = function(x)
{
ans = character(nrow(x))
for(j in seq_along(x)) {
i = x[[j]] > 0L
ans[i] = paste(ans[i], names(x)[[j]], sep = "_")
}
return(gsub("^_", "", ans))
}
system.time(akrun1())
# user system elapsed
# 22.04 0.15 22.36
system.time(akrun2())
# user system elapsed
# 26.33 0.00 26.41
system.time(ronak())
# user system elapsed
# 25.60 0.26 25.96
system.time(alexis(df1[, -1L, with = FALSE]))
# user system elapsed
# 1.92 0.06 2.09
system.time(eddi())
# user system elapsed
# 2.41 0.06 3.19
create column by looking not null values in other columns
You can bfill
horizontally and then select the first column:
df['new_column'] = df.bfill(axis=1).iloc[:, 0]
Output:
>>> df
A B C D E new_column
0 NaN NaN NaN NaN a a
1 b NaN NaN NaN NaN b
2 NaN NaN NaN NaN NaN NaN
How to add a new column on NOT NULL constraint
If the new row is supposed to be NOT NULL
, add a DEFAULT
clause to the column definition:
ALTER TABLE tab
ADD COLUMN Col3 INTEGER NOT null
DEFAULT 0;
Alternatively, omit the NOT NULL
, fill the new column with UPDATE
, then change the column to be NOT NULL
:
ALTER TABLE tab
ALTER col3 SET NOT NULL;
After an UPDATE
on the whole table, you should run VACUUM (FULL) tab
to get rid of the bloat.
Combing non-null values from two columns into one column
np.nan != np.nan
is evaluated to True
. So there are differences between the two commands (what happens when offer id
is nan
?).
Why don't you just use fillna
:
transcript_cp['offer id'].fillna(transcript_cp['offer_id'])
How to create dataframe columns based on dictionaries for non-null columns in Python
If not any column in your dataframe is a course column, you can specify only the course column names in the courses
list. Now I am just skipping the first column there ('ID'):
courses = df.columns[1:]
order = ['ID'] + [col for course in courses for col in (course, course+'_Count', course+'_%')]
for course in courses:
df[course + '_Count'] = count_dict[course]
df.loc[df[course].isna(), course + '_Count'] = np.nan
df[course + '_%'] = df[course] / df[course + '_Count']
df = df[order] # reorder the columns
Result:
ID Science Science_Count Science_% Social Social_Count Social_%
0 1 12.0 30.0 0.400000 24.0 40.0 0.600
1 2 NaN NaN NaN 13.0 40.0 0.325
2 3 26.0 30.0 0.866667 NaN NaN NaN
3 4 23.0 30.0 0.766667 35.0 40.0 0.875
How to combine non-null entries of columns of a DataFrame into a new column?
Use DataFrame.agg
to call dropna
and tolist
:
df.agg(lambda x: x.dropna().tolist(), axis=1)
0 [a, b]
1 [c, d, e]
2 [f, g]
dtype: object
If you need comma separated string instead, use str.cat
or str.join
:
df.agg(lambda x: x.dropna().str.cat(sep=','), axis=1)
# df.agg(lambda x: ','.join(x.dropna()), axis=1)
0 a,b
1 c,d,e
2 f,g
dtype: object
If performance is important, I recommend the use of a list comprehension:
df['output'] = [x[pd.notna(x)].tolist() for x in df.values]
df
col1 col2 col3 output
0 a NaN b [a, b]
1 c d e [c, d, e]
2 f g NaN [f, g]
This works because your DataFrame consists of strings. For more information on when loops are appropriate to use with pandas, see this discussion: For loops with pandas - When should I care?
pyspark - assign non-null columns to new columns
You can use the sql functions greatest
to extract the greatest values in a list of columns.
You can find the documentation here: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.greatest.html
from pyspark.sql import functions as F
(df.withColumn('A', F.greates(F.col('page_1.A'), F.col('page_2.A), F.col('page_3.A'))
.withColumn('B', F.greates(F.col('page_1.B'), F.col('page_2.B), F.col('page_3.B'))
.select('userid', 'datadate', 'A', 'B'))
How to create columns by looking not null values in other columns
Use DataFrame.melt
with missing values by DataFrame.dropna
for unpivot, then add counter columns by GroupBy.cumcount
and reshape by DataFrame.unstack
:
df2 = df1.melt(ignore_index=False,var_name='name',value_name='val').dropna()[['val','name']]
g = df2.groupby(level=0).cumcount().add(1)
df2 = df2.set_index(g,append=True).unstack().sort_index(level=1,axis=1,sort_remaining=False)
df2.columns = df2.columns.map(lambda x: f'Other_{x[1]}_{x[0]}')
print (df2)
Other_1_val Other_1_name Other_2_val Other_2_name
0 a E NaN NaN
1 b C NaN NaN
2 c A d D
3 a B f C
5 e B a E
Last append to original:
df = df1.join(df2)
print (df)
A B C D E Other_1_val Other_1_name Other_2_val Other_2_name
0 NaN NaN NaN NaN a a E NaN NaN
1 NaN NaN b NaN NaN b C NaN NaN
2 c NaN NaN d NaN c A d D
3 NaN a f NaN NaN a B f C
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN e NaN NaN a e B a E
Related Topics
R: How to Get a Sum of Two Distributions
How to Store Filter Expressions as Strings
Http Error 400 on Google_Elevation() Call
How to Change Gender Factor into an Numerical Coding in R
Replace Column Values with Column Name Using Dplyr's Transmute_All
Help Understand the Error in a Function I Defined in R
R: How to Retrieve a Column Name of a Data Frame
Lapply with Anonymous Function Call to Svytable Results in Object 'X' Not Found
How to Get a Minimum Value by Group
R: Reading a Binary File That Is Zipped
Error Using T.Test() in R - Not Enough 'Y' Observations
Map Array of Strings to an Array of Integers
In R Data.Frame, Promote Rownames to Actual Column
Dataframe Is Subseted by Row Number and Not by Cell Value After Clicking on Dt::Datatable
Changing Names in a List of Dataframes