How to Avoid: Read.Table Truncates Numeric Values Beginning with 0

How to avoid: read.table truncates numeric values beginning with 0

As said in Ben's answer, colClasses is the easier way to do it. Here is an example:

read.table(text = 'col1 col2
0012 0001245',
head=T,
colClasses=c('character','numeric'))

col1 col2
1 0012 1245 ## col1 keep 00 but not col2

Keep leading zeros with colsplit in R

We can use read.table

read.table(text=str, sep="~", header=FALSE, colClasses = c("character", "character"))

R: why, how to avoid: read.table turns character (strings) to numeric by removing last character (colon)

With read.table, we can specify the colClasses specified in ?vector

The atomic modes are "logical", "integer", "numeric" (synonym "double"), "complex", "character" and "raw".

The issues is that ?read.table colClasses uses type.convert if not specified to automatically judge the type of the column

Unless colClasses is specified, all columns are read as character columns and then converted using type.convert to logical, integer, numeric, complex or (depending on as.is) factor as appropriate.

The relevant code in read.table would be

...
do[1L] <- FALSE
for (i in (1L:cols)[do]) {
data[[i]] <- if (is.na(colClasses[i]))
type.convert(data[[i]], as.is = as.is[i], dec = dec,
numerals = numerals, na.strings = character(0L))
else if (colClasses[i] == "factor")
as.factor(data[[i]])
else if (colClasses[i] == "Date")
as.Date(data[[i]])
else if (colClasses[i] == "POSIXct")
as.POSIXct(data[[i]])
else methods::as(data[[i]], colClasses[i])
}
...
df <- read.table(file = "df.csv",
header = TRUE,
sep = "\t",
na.strings = "NA",
quote="\"",
fileEncoding = "UTF-8",
colClasses = c("integer", "numeric", "character")
)

-checking the struture

str(df)
'data.frame': 10 obs. of 3 variables:
$ integers: int 1 2 3 4 5 6 7 8 NA 10
$ doubles : num 1.1 2.1 3.1 4.1 5.1 6.1 7.1 NA 9.1 10.1
$ strings : chr "1." "2." "3." "4." ...

How to convert a integer 123033 to time 12:30:33 format in R

You can use str_pad from the stringr package to restore the zeroes:

library(stringr)
time_old <- "2"
time_new <- str_pad(time_old, width = 6, side = "left", pad = 0)

Then, you should be able to use the chron function:

chron::chron(times = time_new, format = list(times = "hms"),
out.format = "h:m:s")
[1] 00:00:02

R: reading in .csv file removes leading zeros

The read.csv, read.table, and related functions read everything in as character strings, then depending on arguments to the function (specifically colClasses, but also others) and options the function will then try to "simplify" the columns. If enough of the column looks numeric and you have not told the function otherwise, then it will convert it to a numeric column, this will drop any leading 0's (and trailing 0's after the decimal). If there is something in the column that does not look like a number then it will not convert to numeric and either keep it as character or convert to a factor, this keeps the leading 0's. The function does not always look at the entire column to make the decision, so what may be obvious to you as not being numeric may still be converted.

The safest approach (and quickest) is to specify colClasses so that R does not need to guess (and you do not need to guess what R is going to guess).

Importing data using R from SQL Server truncate leading zeros

Assuming that the underlying data in the DBMS is indeed "string"-like ...

RODBC::sqlQuery has the as.is= argument that can prevent it from trying to convert values. The default is FALSE, and when false and not a clear type like "date" or "timestamp", RODBC calls type.convert which will see the number-like field and convert it to integers or numbers.

Try:

x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = TRUE)

and that will stop auto-conversion of all columns.

That is a bit nuclear, to be honest, and will stop conversion of dates/times, and perhaps other columns that should be converted. We can narrow this down; ?sqlQuery says that read.table's documentation on as.is is relevant, and it says:

   as.is: controls conversion of character variables (insofar as they
are not converted to logical, numeric or complex) to factors,
if not otherwise specified by 'colClasses'. Its value is
either a vector of logicals (values are recycled if
necessary), or a vector of numeric or character indices which
specify which columns should not be converted to factors.

so if you know which column (by name or column index) is being unnecessarily converted, then you can include it directly. Perhaps

## by column name
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = "somename")

## or by column index
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = 7)

(Side note: while I use select * ... on occasion as well, the presumption of knowing columns by-number is predicated on know all of the columns included in that table/query. If anything changes, perhaps it's actually a SQL view and somebody updates it ... or if somebody changes the order of columns, than your assumptions of column indices is a little fragile. All of my "production" queries in my internal packages have all columns spelled out, no use of select *. I have been bitten once when I used it, which is why I'm a little defensive about it.)

If you don't know, a hastily-dynamic way (that double-taps the query, unfortunately) could be something like

qry10 <- "
select
*
from table_name
limit 10"
x_1 <- sqlQuery(channel=cn_1, query=qry10, stringsAsFactors=FALSE, as.is = TRUE)
leadzero <- sapply(x_1, function(z) all(grepl("^0+[1-9]", z)))
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = which(leadzero))

Caveat: I don't use RODBC nor have I set up a temporary database with appropriately-fashioned values, so this untested.

`read.csv()` imports text column as numeric

See colClasses in ?read.csv:

df = read.csv("data.csv", colClasses="character")

colClasses: character. A vector of classes to be assumed for the
columns. If unnamed, recycled as necessary. If named, names
are matched with unspecified values being taken to be ‘NA’.

Possible values are ‘NA’ (the default, when ‘type.convert’ is
used), ‘"NULL"’ (when the column is skipped), one of the
atomic vector classes (logical, integer, numeric, complex,
character, raw), or ‘"factor"’, ‘"Date"’ or ‘"POSIXct"’.



Related Topics



Leave a reply



Submit