Efficiently Transform Multiple Columns of a Data Frame

Efficiently transform multiple columns of a data frame

In the case of functions that will return a data.frame:

cols <- c("X.1","X.2")
df[cols] <- log(df[cols])

Otherwise you will need to use lapply or a loop over the columns. These solutions will be slower than the solution above, so only use them if you must.

df[cols] <- lapply(df[cols], function(x) c(NA,diff(x)))
for(col in cols) {
df[col] <- c(NA,diff(df[col]))
}

Transform multiple columns the tidy way

We can use

df %>%
mutate_at(vars(matches("iq")), log)

One advantage with matches is that it can take multiple patterns to be matched in a single call. For e.g., if we need to apply the function on columns that start (^) with 'iq' or (|) those end ($) with 'oq', this can be passed into the single matches

df %>%
mutate_at(vars(matches('^iq|oq$'), log)

If the column names are completely different and there are n patterns for the n column, but if there is still some order in the position of columns, then the column position numbers can be passed into the vars. In the current example, the 'iq' columns are the 1st and 2nd columns

df %>% 
mutate_at(1:2, log)

Similarly, if the 20 columns occupy the 1st 20 positions

df %>%
mutate_at(1:20, log)

Or if the positions are 1 to 6, 8 to 12, 41:50

df %>%
mutate_at(vars(1:6, 8:12, 41:50), log)

Convert multiple columns to string in pandas dataframe

To convert multiple columns to string, include a list of columns to your above-mentioned command:

df[['one', 'two', 'three']] = df[['one', 'two', 'three']].astype(str)
# add as many column names as you like.

That means that one way to convert all columns is to construct the list of columns like this:

all_columns = list(df) # Creates list of all column headers
df[all_columns] = df[all_columns].astype(str)

Note that the latter can also be done directly (see comments).

How to efficiently add multiple columns to pandas dataframe with values that depend on other columns

If you only have 50 conditions to check for it is probably better to iterate through the conditions and fill the cells in blocks rather than going through the whole frame row by row. By the way .assign() doesn't just accept lambda functions and the code can also be made a lot more readable than in my previous suggestion. Below is a modified version that also fills the extra columns in place. If this data frame had 10,000,000 rows and I only wanted to apply different operations to 10 groups of number ranges in column A this would be a very neat way of filling the extra columns.

import pandas as pd
import numpy as np

# Create data frame
rnd = np.random.randint(1, 10, 10)
rnd2 = np.random.randint(100, 1000, 10)
df = pd.DataFrame(
{'A': rnd, 'B': rnd2, 'C': np.nan, 'D': np.nan, 'E': np.nan })

# Define different ways of filling the extra cells
def f1():
return df['A'].mul(df['B'])

def f2():
return np.log10(df['A'])

def f3():
return df['B'] - df['A']

def f4():
return df['A'].div(df['B'])

def f5():
return np.sqrt(df['B'])

def f6():
return df['A'] + df['B']

# First assign() dependent on a boolean mask
df[df['A'] < 50] = df[df['A'] < 15].assign(C = f1(), D = f2(), E = f3())

# Second assign() dependent on a boolean mask
df[df['A'] >= 50] = df[df['A'] >= 50].assign(C = f4(), D = f5(), E = f6())

print(df)

A B C D E
0 4.0 845.0 3380.0 0.602060 841
1 3.0 967.0 2901.0 0.477121 964
2 3.0 468.0 1404.0 0.477121 465
3 2.0 548.0 1096.0 0.301030 546
4 3.0 393.0 1179.0 0.477121 390
5 7.0 741.0 5187.0 0.845098 734
6 1.0 269.0 269.0 0.000000 268
7 4.0 731.0 2924.0 0.602060 727
8 4.0 193.0 772.0 0.602060 189
9 3.0 306.0 918.0 0.477121 303

converting multiple columns from character to numeric format in r

You could try

DF <- data.frame("a" = as.character(0:5),
"b" = paste(0:5, ".1", sep = ""),
"c" = letters[1:6],
stringsAsFactors = FALSE)

# Check columns classes
sapply(DF, class)

# a b c
# "character" "character" "character"

cols.num <- c("a","b")
DF[cols.num] <- sapply(DF[cols.num],as.numeric)
sapply(DF, class)

# a b c
# "numeric" "numeric" "character"

Transforming a Column into Multiple Columns according to Their Values

Source DF:

In [204]: df
Out[204]:
Country
0 Italy
1 Indonesia
2 Canada
3 Italy

we can use pd.get_dummies():

In [205]: pd.get_dummies(df.Country)
Out[205]:
Canada Indonesia Italy
0 0 0 1
1 0 1 0
2 1 0 0
3 0 0 1

Or sklearn.feature_extraction.text.CountVectorizer:

In [211]: from sklearn.feature_extraction.text import CountVectorizer

In [212]: cv = CountVectorizer()

In [213]: r = pd.SparseDataFrame(cv.fit_transform(df.Country),
columns=cv.get_feature_names(),
index=df.index,
default_fill_value=0)

In [214]: r
Out[214]:
canada indonesia italy
0 0 0 1
1 0 1 0
2 1 0 0
3 0 0 1

Elegant and efficient way to replace values in multiple columns using pandas

You can clip:

cols = ["los_24", "los_48", "in_24", "in_48"]

f[cols] = f[cols].clip(lower=1)

to get

   person_id  test_id  los_24  los_48   in_24  in_48 test
0 101 123 1.00 1.0 21.000 11.3 A
1 101 123 1.00 1.0 24.000 202.0 B
2 101 124 1.00 1.0 1.000 1.0 C
3 201 321 1.01 1.0 2.300 1.0 D
4 201 321 2.00 11.0 1.000 41.0 E
5 201 321 1.00 2.0 23.000 47.0 F
6 203 456 2.00 3.0 1.001 2.0 G


Related Topics



Leave a reply



Submit