Create a New Column with Non-Null Columns' Names

Create a new column with non-null columns' names

One option would be to convert the format from 'wide' to 'long' using melt. Grouped by 'a', we paste the 'variable' elements that corresponds to non-zero elements in 'value' (provided as logical condition in 'i').

melt(df, id.var='a')[value!=0, 
.(z=paste(variable, collapse="_")), keyby =a]
# a z
#1: 1 b_d
#2: 2 c
#3: 3 b_c
#4: 4 b_d
#5: 5 b_d

Or instead of melting, we can group by 'a', unlist the Subset of Data.table (.SD) and paste the names of the columns that corresponds to non-zero elements ('i1').

df[, {i1 <- !!unlist(.SD)
paste(names(.SD)[i1], collapse="_")} , by= a]

Benchmarks

set.seed(24)
df1 <- data.table(a=1:1e6, b = sample(0:5, 1e6,
replace=TRUE), c = sample(0:4, 1e6, replace=TRUE),
d = sample(0:3, 1e6, replace=TRUE))

akrun1 <- function() {
melt(df1, id.var='a')[value!=0,
.(z=paste(variable, collapse="_")), keyby =a]
}

akrun2 <- function() {
df1[, {i1 <- !!unlist(.SD)
paste(names(.SD)[i1], collapse="_")} , by= a]
}

ronak <- function() {
data.table(z = lapply(apply(df1, 1, function(x)
which(x[-1]!= 0)),
function(x) paste0(names(x), collapse = "_")))
}

eddi <- function(){
df1[, newcol := gsub("NA_|_NA|NA", "",
do.call(function(...) paste(..., sep = "_"),
Map(function(x, y) x[(y == 0) + 1], names(.SD), .SD)))
, .SDcols = b:d]

}

alexis = function(x)
{
ans = character(nrow(x))
for(j in seq_along(x)) {
i = x[[j]] > 0L
ans[i] = paste(ans[i], names(x)[[j]], sep = "_")
}
return(gsub("^_", "", ans))
}

system.time(akrun1())
# user system elapsed
# 22.04 0.15 22.36
system.time(akrun2())
# user system elapsed
# 26.33 0.00 26.41
system.time(ronak())
# user system elapsed
# 25.60 0.26 25.96

system.time(alexis(df1[, -1L, with = FALSE]))
# user system elapsed
# 1.92 0.06 2.09

system.time(eddi())
# user system elapsed
# 2.41 0.06 3.19

create column by looking not null values in other columns

You can bfill horizontally and then select the first column:

df['new_column'] = df.bfill(axis=1).iloc[:, 0]

Output:

>>> df
A B C D E new_column
0 NaN NaN NaN NaN a a
1 b NaN NaN NaN NaN b
2 NaN NaN NaN NaN NaN NaN

How to add a new column on NOT NULL constraint

If the new row is supposed to be NOT NULL, add a DEFAULT clause to the column definition:

ALTER TABLE tab
ADD COLUMN Col3 INTEGER NOT null
DEFAULT 0;

Alternatively, omit the NOT NULL, fill the new column with UPDATE, then change the column to be NOT NULL:

ALTER TABLE tab
ALTER col3 SET NOT NULL;

After an UPDATE on the whole table, you should run VACUUM (FULL) tab to get rid of the bloat.

Combing non-null values from two columns into one column

np.nan != np.nan is evaluated to True. So there are differences between the two commands (what happens when offer id is nan?).

Why don't you just use fillna:

transcript_cp['offer id'].fillna(transcript_cp['offer_id'])

How to create dataframe columns based on dictionaries for non-null columns in Python

If not any column in your dataframe is a course column, you can specify only the course column names in the courses list. Now I am just skipping the first column there ('ID'):

courses = df.columns[1:]

order = ['ID'] + [col for course in courses for col in (course, course+'_Count', course+'_%')]

for course in courses:
df[course + '_Count'] = count_dict[course]
df.loc[df[course].isna(), course + '_Count'] = np.nan
df[course + '_%'] = df[course] / df[course + '_Count']

df = df[order] # reorder the columns

Result:

   ID  Science  Science_Count  Science_%  Social  Social_Count  Social_%
0 1 12.0 30.0 0.400000 24.0 40.0 0.600
1 2 NaN NaN NaN 13.0 40.0 0.325
2 3 26.0 30.0 0.866667 NaN NaN NaN
3 4 23.0 30.0 0.766667 35.0 40.0 0.875

How to combine non-null entries of columns of a DataFrame into a new column?

Use DataFrame.agg to call dropna and tolist:

df.agg(lambda x: x.dropna().tolist(), axis=1)

0 [a, b]
1 [c, d, e]
2 [f, g]
dtype: object

If you need comma separated string instead, use str.cat or str.join:

df.agg(lambda x: x.dropna().str.cat(sep=','), axis=1)
# df.agg(lambda x: ','.join(x.dropna()), axis=1)

0 a,b
1 c,d,e
2 f,g
dtype: object

If performance is important, I recommend the use of a list comprehension:

df['output'] = [x[pd.notna(x)].tolist() for x in df.values]
df

col1 col2 col3 output
0 a NaN b [a, b]
1 c d e [c, d, e]
2 f g NaN [f, g]

This works because your DataFrame consists of strings. For more information on when loops are appropriate to use with pandas, see this discussion: For loops with pandas - When should I care?

pyspark - assign non-null columns to new columns

You can use the sql functions greatest to extract the greatest values in a list of columns.
You can find the documentation here: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.greatest.html

from pyspark.sql import functions as F
(df.withColumn('A', F.greates(F.col('page_1.A'), F.col('page_2.A), F.col('page_3.A'))
.withColumn('B', F.greates(F.col('page_1.B'), F.col('page_2.B), F.col('page_3.B'))
.select('userid', 'datadate', 'A', 'B'))

How to create columns by looking not null values in other columns

Use DataFrame.melt with missing values by DataFrame.dropna for unpivot, then add counter columns by GroupBy.cumcount and reshape by DataFrame.unstack:

df2 = df1.melt(ignore_index=False,var_name='name',value_name='val').dropna()[['val','name']]

g = df2.groupby(level=0).cumcount().add(1)
df2 = df2.set_index(g,append=True).unstack().sort_index(level=1,axis=1,sort_remaining=False)
df2.columns = df2.columns.map(lambda x: f'Other_{x[1]}_{x[0]}')
print (df2)
Other_1_val Other_1_name Other_2_val Other_2_name
0 a E NaN NaN
1 b C NaN NaN
2 c A d D
3 a B f C
5 e B a E

Last append to original:

df = df1.join(df2)
print (df)
A B C D E Other_1_val Other_1_name Other_2_val Other_2_name
0 NaN NaN NaN NaN a a E NaN NaN
1 NaN NaN b NaN NaN b C NaN NaN
2 c NaN NaN d NaN c A d D
3 NaN a f NaN NaN a B f C
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN e NaN NaN a e B a E


Related Topics



Leave a reply



Submit