Why Does As.Matrix Add Extra Spaces When Converting Numeric to Character

Why does as.matrix add extra spaces when converting numeric to character?

This is because of the way non-numeric data are converted in the as.matrix.data.frame method. There is a simple work-around, shown below.

Details

?as.matrix notes that conversion is done via format(), and it is here that the additional spaces are added. Specifically, ?as.matrix has this in the Details section:

 ‘as.matrix’ is a generic function.  The method for data frames
will return a character matrix if there is only atomic columns and
any non-(numeric/logical/complex) column, applying ‘as.vector’ to
factors and ‘format’ to other non-character columns. Otherwise,
the usual coercion hierarchy (logical < integer < double <
complex) will be used, e.g., all-logical data frames will be
coerced to a logical matrix, mixed logical-integer will give a
integer matrix, etc.

?format also notes that

Character strings are padded with blanks to the display width of the widest.

Consider this example which illustrates the behaviour

> format(df[,2])
[1] "100" " 90" " 8"
> nchar(format(df[,2]))
[1] 3 3 3

format doesn't have to work this way as it has trim:

trim: logical; if ‘FALSE’, logical, numeric and complex values are
right-justified to a common width: if ‘TRUE’ the leading
blanks for justification are suppressed.

e.g.

> format(df[,2], trim = TRUE)
[1] "100" "90" "8"

but there is no way to pass this argument along to the as.matrix.data.frame method.

Workaround

A way to work around this is to apply format() yourself, manually, via sapply. There you can pass in trim = TRUE

> sapply(df, format, trim = TRUE)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"

or, using vapply we can state what we expect to be returned (here character vectors of length 3 [nrow(df)]):

> vapply(df, format, FUN.VALUE = character(nrow(df)), trim = TRUE)
id1 id2
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"

How can I prevent leading spaces when transforming integer columns to character in R?

If I remember correctly, it's because of as.matrix, but you can bypass this by using sapply:

> sapply(example, format, trim = TRUE)
a b c
[1,] "1" "1" "foo"
[2,] "2" "2" "foo"
[3,] "3" "3" "foo"
[4,] "4" "4" "foo"
[5,] "5" "5" "foo"
[6,] "6" "6" "foo"
[7,] "7" "7" "foo"
[8,] "8" "8" "foo"
[9,] "9" "9" "foo"
[10,] "10" "10" "foo"

Why does class change from integer to character when indexing a data frame with a numeric matrix?

In ?Extract it is described that indexing via a numeric matrix is intended for matrices and arrays. So it might be surprising that such indexing worked for a data frame in the first place.

However, if we look at the code for [.data.frame (getAnywhere(`[.data.frame`)), we see that when extracting elements from a data.frame using a matrix in i, the data.frame is first coerced to a matrix with as.matrix:

function (x, i, j, drop = if (missing(i)) TRUE else length(cols) == 
1)
{
# snip
if (Narg < 3L) {
# snip
if (is.matrix(i))
return(as.matrix(x)[i])

Then look at ?as.matrix:

"The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column".

Thus, because the first column in "df2" is of class character, as.matrix will coerce the entire data frame to a character matrix before the extraction takes place.

Converting a vector of characters including the ^ operator to numeric

Using:

sapply(gsub("[<>]", "", mix), function(x) eval(parse(text = x)), USE.NAMES = FALSE)

gives:

[1]    50 10000    10   325

If you omit USE.NAMES = FALSE, you will get a named vector:

   50  10^4    10   325 
50 10000 10 325

It might be worthwhile to consider vapply over sapply (see here for an explanation why this is safer):

vapply(gsub("[<>]", "", mix), function(x) eval(parse(text = x)),
FUN.VALUE = 1, USE.NAMES = FALSE)

Is there a way to keep all the object types of each column in a dataframe when converting the dataframe into a matrix?

No, you can't do that. In R, a matrix has to be all one type: it is stored as a vector of that type together with an attribute saying how many rows and columns it has.

For efficiency, you're right that matrices are a lot faster than dataframes. Maybe you can split your dataframe into one numeric one and one character one. Most other types can be coerced to those types without much loss.

Why does `apply(df, 2, class)` shows every column to be characters?

apply coerces your data frame into a matrix. Try lapply instead:

lapply(df, class) 
$s
[1] "factor"

$n
[1] "numeric"

Why does as.factor return a character when used inside apply?

apply converts your data.frame to a character matrix. Use lapply:

lapply(a, class)
# $x1
# [1] "numeric"
# $x2
# [1] "factor"
# $x3
# [1] "factor"

In second command apply converts result to character matrix, using lapply:

a2 <- lapply(a, as.factor)
lapply(a2, class)
# $x1
# [1] "factor"
# $x2
# [1] "factor"
# $x3
# [1] "factor"

But for simple lookout you could use str:

str(a)
# 'data.frame': 100 obs. of 3 variables:
# $ x1: num -1.79 -1.091 1.307 1.142 -0.972 ...
# $ x2: Factor w/ 2 levels "a","b": 2 1 1 1 2 1 1 1 1 2 ...
# $ x3: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...

Additional explanation according to comments:

Why does the lapply work while apply doesn't?

The first thing that apply does is to convert an argument to a matrix. So apply(a) is equivalent to apply(as.matrix(a)). As you can see str(as.matrix(a)) gives you:

chr [1:100, 1:3] " 0.075124364" "-1.608618269" "-1.487629526" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "x1" "x2" "x3"

There are no more factors, so class return "character" for all columns.

lapply works on columns so gives you what you want (it does something like class(a$column_name) for each column).

You can see in help to apply why apply and as.factor doesn't work :

In all cases the result is coerced by
as.vector to one of the basic vector
types before the dimensions are set,
so that (for example) factor results
will be coerced to a character array.

Why sapply and as.factor doesn't work you can see in help to sapply:

Value (...) An atomic vector or matrix
or list of the same length as X (...)
If simplification occurs, the output
type is determined from the highest
type of the return values in the
hierarchy NULL < raw < logical <
integer < real < complex < character <
list < expression, after coercion of
pairlists to lists.

You never get matrix of factors or data.frame.

How to convert output to data.frame?

Simple, use as.data.frame as you wrote in comment:

a2 <- as.data.frame(lapply(a, as.factor))
str(a2)
'data.frame': 100 obs. of 3 variables:
$ x1: Factor w/ 100 levels "-2.49629293159922",..: 60 6 7 63 45 93 56 98 40 61 ...
$ x2: Factor w/ 2 levels "a","b": 1 1 2 2 2 2 2 1 2 2 ...
$ x3: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...

But if you want to replace selected character columns with factor there is a trick:

a3 <- data.frame(x1=letters, x2=LETTERS, x3=LETTERS, stringsAsFactors=FALSE)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: chr "a" "b" "c" "d" ...
$ x2: chr "A" "B" "C" "D" ...
$ x3: chr "A" "B" "C" "D" ...

columns_to_change <- c("x1","x2")
a3[, columns_to_change] <- lapply(a3[, columns_to_change], as.factor)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x2: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x3: chr "A" "B" "C" "D" ...

You could use it to replace all columns using:

a3 <- data.frame(x1=letters, x2=LETTERS, x3=LETTERS, stringsAsFactors=FALSE)
a3[, ] <- lapply(a3, as.factor)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x2: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x3: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...


Related Topics



Leave a reply



Submit