Why does as.matrix add extra spaces when converting numeric to character?
This is because of the way non-numeric data are converted in the as.matrix.data.frame
method. There is a simple work-around, shown below.
Details
?as.matrix
notes that conversion is done via format()
, and it is here that the additional spaces are added. Specifically, ?as.matrix
has this in the Details section:
‘as.matrix’ is a generic function. The method for data frames
will return a character matrix if there is only atomic columns and
any non-(numeric/logical/complex) column, applying ‘as.vector’ to
factors and ‘format’ to other non-character columns. Otherwise,
the usual coercion hierarchy (logical < integer < double <
complex) will be used, e.g., all-logical data frames will be
coerced to a logical matrix, mixed logical-integer will give a
integer matrix, etc.
?format
also notes that
Character strings are padded with blanks to the display width of the widest.
Consider this example which illustrates the behaviour
> format(df[,2])
[1] "100" " 90" " 8"
> nchar(format(df[,2]))
[1] 3 3 3
format
doesn't have to work this way as it has trim
:
trim: logical; if ‘FALSE’, logical, numeric and complex values are
right-justified to a common width: if ‘TRUE’ the leading
blanks for justification are suppressed.
e.g.
> format(df[,2], trim = TRUE)
[1] "100" "90" "8"
but there is no way to pass this argument along to the as.matrix.data.frame
method.
Workaround
A way to work around this is to apply format()
yourself, manually, via sapply
. There you can pass in trim = TRUE
> sapply(df, format, trim = TRUE)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
or, using vapply
we can state what we expect to be returned (here character vectors of length 3 [nrow(df)
]):
> vapply(df, format, FUN.VALUE = character(nrow(df)), trim = TRUE)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"
How can I prevent leading spaces when transforming integer columns to character in R?
If I remember correctly, it's because of as.matrix
, but you can bypass this by using sapply
:
> sapply(example, format, trim = TRUE)
a b c
[1,] "1" "1" "foo"
[2,] "2" "2" "foo"
[3,] "3" "3" "foo"
[4,] "4" "4" "foo"
[5,] "5" "5" "foo"
[6,] "6" "6" "foo"
[7,] "7" "7" "foo"
[8,] "8" "8" "foo"
[9,] "9" "9" "foo"
[10,] "10" "10" "foo"
Why does class change from integer to character when indexing a data frame with a numeric matrix?
In ?Extract
it is described that indexing via a numeric matrix is intended for matrices and arrays. So it might be surprising that such indexing worked for a data frame in the first place.
However, if we look at the code for [.data.frame
(getAnywhere(`[.data.frame`)
), we see that when extracting elements from a data.frame
using a matrix
in i
, the data.frame
is first coerced to a matrix
with as.matrix
:
function (x, i, j, drop = if (missing(i)) TRUE else length(cols) ==
1)
{
# snip
if (Narg < 3L) {
# snip
if (is.matrix(i))
return(as.matrix(x)[i])
Then look at ?as.matrix
:
"The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column".
Thus, because the first column in "df2" is of class character
, as.matrix
will coerce the entire data frame to a character
matrix before the extraction takes place.
Converting a vector of characters including the ^ operator to numeric
Using:
sapply(gsub("[<>]", "", mix), function(x) eval(parse(text = x)), USE.NAMES = FALSE)
gives:
[1] 50 10000 10 325
If you omit USE.NAMES = FALSE
, you will get a named vector:
50 10^4 10 325
50 10000 10 325
It might be worthwhile to consider vapply
over sapply
(see here for an explanation why this is safer):
vapply(gsub("[<>]", "", mix), function(x) eval(parse(text = x)),
FUN.VALUE = 1, USE.NAMES = FALSE)
Is there a way to keep all the object types of each column in a dataframe when converting the dataframe into a matrix?
No, you can't do that. In R, a matrix has to be all one type: it is stored as a vector of that type together with an attribute saying how many rows and columns it has.
For efficiency, you're right that matrices are a lot faster than dataframes. Maybe you can split your dataframe into one numeric one and one character one. Most other types can be coerced to those types without much loss.
Why does `apply(df, 2, class)` shows every column to be characters?
apply
coerces your data frame into a matrix. Try lapply
instead:
lapply(df, class)
$s
[1] "factor"
$n
[1] "numeric"
Why does as.factor return a character when used inside apply?
apply
converts your data.frame to a character matrix. Use lapply
:
lapply(a, class)
# $x1
# [1] "numeric"
# $x2
# [1] "factor"
# $x3
# [1] "factor"
In second command apply converts result to character matrix, using lapply
:
a2 <- lapply(a, as.factor)
lapply(a2, class)
# $x1
# [1] "factor"
# $x2
# [1] "factor"
# $x3
# [1] "factor"
But for simple lookout you could use str
:
str(a)
# 'data.frame': 100 obs. of 3 variables:
# $ x1: num -1.79 -1.091 1.307 1.142 -0.972 ...
# $ x2: Factor w/ 2 levels "a","b": 2 1 1 1 2 1 1 1 1 2 ...
# $ x3: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
Additional explanation according to comments:
Why does the lapply work while apply doesn't?
The first thing that apply
does is to convert an argument to a matrix. So apply(a)
is equivalent to apply(as.matrix(a))
. As you can see str(as.matrix(a))
gives you:
chr [1:100, 1:3] " 0.075124364" "-1.608618269" "-1.487629526" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "x1" "x2" "x3"
There are no more factors, so class
return "character"
for all columns.lapply
works on columns so gives you what you want (it does something like class(a$column_name)
for each column).
You can see in help to apply
why apply
and as.factor
doesn't work :
In all cases the result is coerced by
as.vector to one of the basic vector
types before the dimensions are set,
so that (for example) factor results
will be coerced to a character array.
Why sapply
and as.factor
doesn't work you can see in help to sapply
:
Value (...) An atomic vector or matrix
or list of the same length as X (...)
If simplification occurs, the output
type is determined from the highest
type of the return values in the
hierarchy NULL < raw < logical <
integer < real < complex < character <
list < expression, after coercion of
pairlists to lists.
You never get matrix of factors or data.frame.
How to convert output to data.frame
?
Simple, use as.data.frame
as you wrote in comment:
a2 <- as.data.frame(lapply(a, as.factor))
str(a2)
'data.frame': 100 obs. of 3 variables:
$ x1: Factor w/ 100 levels "-2.49629293159922",..: 60 6 7 63 45 93 56 98 40 61 ...
$ x2: Factor w/ 2 levels "a","b": 1 1 2 2 2 2 2 1 2 2 ...
$ x3: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
But if you want to replace selected character columns with factor
there is a trick:
a3 <- data.frame(x1=letters, x2=LETTERS, x3=LETTERS, stringsAsFactors=FALSE)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: chr "a" "b" "c" "d" ...
$ x2: chr "A" "B" "C" "D" ...
$ x3: chr "A" "B" "C" "D" ...
columns_to_change <- c("x1","x2")
a3[, columns_to_change] <- lapply(a3[, columns_to_change], as.factor)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x2: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x3: chr "A" "B" "C" "D" ...
You could use it to replace all columns using:
a3 <- data.frame(x1=letters, x2=LETTERS, x3=LETTERS, stringsAsFactors=FALSE)
a3[, ] <- lapply(a3, as.factor)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x2: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x3: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
Related Topics
Repeat the Re-Sampling Function for 1000 Times? Using Lapply
Adding Labels on Curves in Glmnet Plot in R
Contrast Between Label and Background: Determine If Color Is Light or Dark
Ggplot2: Creating Themed Title, Subtitle with Cowplot
Can't Read an .Rdata Fileinput
Ggplot2: Problem with X Axis When Adding Regression Line Equation on Each Facet
Predict X Values from Simple Fitting and Annoting It in the Plot
Aggregating Unique Values in Columns to Single Dataframe "Cell"
Annotate Ggplot2 Facets with Number of Observations Per Facet
Control Alpha Blending/Opacity of N Overlapping Areas
Documentation for Special Variables in Ggplot (..Count.., ..Density.., etc.)
How to Figure Third Friday of a Month in R
How to Replace Multiple Strings with the Same in R
Back-To-Back Barplot with Independent Axes R
How to Scrape Items Together So You Don't Lose the Index
How to Apply Dplyr's Select(,Starts_With()) on Rows, Not Columns
Finding Maximum Value of One Column (By Group) and Inserting Value into Another Data Frame in R