sapply + if - retain column names
The root of your issue is that you are using apply
on a data frame. apply
is built to work on matrices, so the first thing it does is convert your data frame to a matrix, which is unnecessary, and then the default data frame methods when you convert back "fix" the column names in a way you don't like. You may be able to fix this by adding check.names = FALSE
to your as.data.frame()
call, but a better approach would use lapply
on a data frame, apply
on a matrix, and even have it work if we give it a vector input.
I'd also strongly recommend not overwriting the built-in scale
function with a similar-but-different function. That could easily cause bugs. I've rewritten your function calling it scale01()
to make the distinction clear.
I also modified it so if the input is a constant vector with missing values, only the non-missing values will be filled in with 0.5
, which seems safer.
I use S3 dispatch to work appropriately based on the input class, built on a default
method that works on numeric vectors. Here it is, demonstrated on vector, data.frame, and matrix inputs:
## defining the functions
scale01 = function(x, ...) {
UseMethod("scale01")
}
scale01.numeric = function(x, ...) {
minx = min(x, na.rm = TRUE)
maxx = max(x, na.rm = TRUE)
if(minx == maxx) {
x[!is.na(x)] = 0.5
return(x)
}
(x - minx) / (maxx - minx)
}
scale01.data.frame = function(x, ...) {
x[] = lapply(x, scale01)
x
}
scale01.matrix = function(x, ...) {
apply(x, MARGIN = 2, FUN = scale01)
}
## demonstrating usage
scale01(rnorm(5))
# [1] 0.0000000 1.0000000 0.4198958 0.6104154 0.2108150
scale01(mtcars[1:5, ])
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 0.5609756 0.5 0.2063492 0.2073171 1.00000000 0.2678571 0.0000000 0 1 1 1.0000000
# Mazda RX4 Wag 0.5609756 0.5 0.2063492 0.2073171 1.00000000 0.4955357 0.1879195 0 1 1 1.0000000
# Datsun 710 1.0000000 0.0 0.0000000 0.0000000 0.93902439 0.0000000 0.7214765 1 1 1 0.0000000
# Hornet 4 Drive 0.6585366 0.5 0.5952381 0.2073171 0.00000000 0.7991071 1.0000000 1 0 0 0.0000000
# Hornet Sportabout 0.0000000 1.0 1.0000000 1.0000000 0.08536585 1.0000000 0.1879195 0 0 0 0.3333333
scale01(as.matrix(mtcars[1:5, ]))
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 0.5609756 0.5 0.2063492 0.2073171 1.00000000 0.2678571 0.0000000 0 1 1 1.0000000
# Mazda RX4 Wag 0.5609756 0.5 0.2063492 0.2073171 1.00000000 0.4955357 0.1879195 0 1 1 1.0000000
# Datsun 710 1.0000000 0.0 0.0000000 0.0000000 0.93902439 0.0000000 0.7214765 1 1 1 0.0000000
# Hornet 4 Drive 0.6585366 0.5 0.5952381 0.2073171 0.00000000 0.7991071 1.0000000 1 0 0 0.0000000
# Hornet Sportabout 0.0000000 1.0 1.0000000 1.0000000 0.08536585 1.0000000 0.1879195 0 0 0 0.3333333
weird_name_df = data.frame(`weird column` = rnorm(5), `INL_Avg(S-B0-ETC-CDS-06C~PM_CD1_D_B0_SI_P0V_B.NM)` = rnorm(5), check.names = FALSE)
scale01(weird_name_df)
# weird column INL_Avg(S-B0-ETC-CDS-06C~PM_CD1_D_B0_SI_P0V_B.NM)
# 1 0.6135744 0.2237905
# 2 0.0000000 0.4086837
# 3 1.0000000 1.0000000
# 4 0.7061441 0.2803262
# 5 0.7693184 0.0000000
If you want to transform all the numeric columns of a data frame, I would suggest:
## base version
numeric_cols = sapply(your_data, is.numeric)
your_data[numeric_cols] = scale01(your_data[numeric_cols])
## dplyr version
library(dplyr)
your_data %>%
mutate(across(where(is.numeric), scale01))
Get column names in apply function with data frame (R)
Here is how I would do it with purrr::iwalk()
:
purrr::iwalk(airquality, ~ message(sprintf("%s has %s cases.\nNA values: %s",
.y,
sum(!is.na(.x)),
sum(is.na(.x)))))
Output:
Ozone has 116 cases.
NA values: 37
Solar.R has 146 cases.
NA values: 7
Wind has 153 cases.
NA values: 0
Temp has 153 cases.
NA values: 0
Month has 153 cases.
NA values: 0
Day has 153 cases.
NA values: 0
Get column name in apply function
Well, if nobody answers you got to find out yourself... And I found out that you can call sapply with an index list and use this index in the function. So the solution is:
x <- c(1,1,2,2,2,3)
y <- c(2,3,4,5,4,4)
Tb <- data.frame(x,y)
Dq_Hist <- function(i){
Name <- colnames(Tb)[i]
Ttl <- paste('Variable: ',Name,'')
hist(Tb[,i],main=Ttl,col=c('grey'),xlab=Name)
}
D <- sapply(1:ncol(Tb),Dq_Hist)
Access to column name of dataframe with *apply function
In this case apply
is what you need. All of the data columns are of the same type and you don't have any worries about loosing attributes, which is where apply causes problems. You will need to write your function differently so it just takes one vector of length 4:
fDist <- function(vec) {
return (0.1*((vec[1] - vec[2])^2 + (vec[3]-vec[4])^2)^0.5)
}
data$f_dist <- apply(data, 1, fDist)
data
X1 Y1 X2 Y2 f_dist
1 3.5 2.1 4.1 2.9 0.1843909
2 3.1 1.2 0.8 4.3 0.3982462
If you wanted to use the names of the columns in 'data' then they need to be spelled correctly:
fDist <- function(vec) {
return (0.1*((vec['X1'] - vec['X2'])^2 + (vec['Y1']-vec['Y2'])^2)^0.5)
}
data$f_dist <- apply(data, 1, fDist)
data
#--------
X1 Y1 X2 Y2 f_dist
1 3.5 2.1 4.1 2.9 0.1000000
2 3.1 1.2 0.8 4.3 0.3860052
Your updated (and very different) question is easy to resolve. When you use apply
it coerces to the lowest common mode denominator, in this case 'character'. You have two choices: either 1) add as.numeric
to all of your arguments inside the functions, or 2) only send the columns that are needed which I will illustrate:
data2$f_dist <- apply(data2[ , c("X2", "Y2") ], 1, function(coords)
{fDist2(data2[1,]$X1,data2[1,]$Y1, coords)} )
I really do not like how you are passing parameters to this function. Using "[" and "$" within the formals list "just looks wrong." And you should know that "df" will not be a dataframe, but rather a vector. Because it's not a dataframe (or a list) you should alter the function inside so that it uses "[" rather than "[[". Since you only want two of the coordinates, then only pass the two (numeric) ones that you would be using.
R - refer to column names rather than column index when using lapply with data frame
You can use sapply()
as follows. The problem in this example is that you cannot set ranges of columns by name easily.
cols <- c("A", "B", "D", "F", "G", "H")
df[,cols] <- sapply(df[,cols], \(x) (5:1)[x])
The easiest way to select by a range of columns is to use eval_select()
to return their positions by number. But if you do this, you might as well just use the dplyr
solution. This is essentially an under the hood look at it.
library(tidyselect)
col_pos <- eval_select(expr(c(A:B, D, F:H)), df)
df[,col_pos] <- sapply(df[,col_pos], \(x) (5:1)[x])
Using lapply to set column names for a list of data frames?
It seems you want to update the original dataframes. In that case, your list MUST be named. ie check the code below.
List <- list(a = a, b = b, c = c, d = d)
list2env(lapply(List, setNames, nm = headers), globalenv())
Now if you call a
you will note that it has been updated.
Related Topics
Using Functions and Environments
Rbindlist Two Data.Tables Where One Has Factor and Other Has Character Type for a Column
Simple R 3D Interpolation/Surface Plot
Calling Library() in R with a Variable as the Argument
Using R - Delete Rows When a Value Repeated Less Than 3 Times
Filled Contour Plot with R/Ggplot/Ggmap
Compute All Pairwise Differences Within a Vector in R
Sum Non Na Elements Only, But If All Na Then Return Na
Open Hyperlink on Click on an Ggplot/Plotly Chart
Can You Pass a Vector to a Vararg: Vector to Sprintf
How to Access the Name of the Variable Assigned to the Result of a Function Within the Function
What Is the Internal Implementation of Lists
Text Color Based on Contrast Against Background
Arranging Arrows Between Points Nicely in Ggplot2
Are Factors Stored More Efficiently in Data.Table Than Characters