How to avoid: read.table truncates numeric values beginning with 0
As said in Ben's answer, colClasses
is the easier way to do it. Here is an example:
read.table(text = 'col1 col2
0012 0001245',
head=T,
colClasses=c('character','numeric'))
col1 col2
1 0012 1245 ## col1 keep 00 but not col2
Keep leading zeros with colsplit in R
We can use read.table
read.table(text=str, sep="~", header=FALSE, colClasses = c("character", "character"))
R: why, how to avoid: read.table turns character (strings) to numeric by removing last character (colon)
With read.table
, we can specify the colClasses
specified in ?vector
The atomic modes are "logical", "integer", "numeric" (synonym "double"), "complex", "character" and "raw".
The issues is that ?read.table
colClasses
uses type.convert
if not specified to automatically judge the type of the column
Unless colClasses is specified, all columns are read as character columns and then converted using type.convert to logical, integer, numeric, complex or (depending on as.is) factor as appropriate.
The relevant code in read.table
would be
...
do[1L] <- FALSE
for (i in (1L:cols)[do]) {
data[[i]] <- if (is.na(colClasses[i]))
type.convert(data[[i]], as.is = as.is[i], dec = dec,
numerals = numerals, na.strings = character(0L))
else if (colClasses[i] == "factor")
as.factor(data[[i]])
else if (colClasses[i] == "Date")
as.Date(data[[i]])
else if (colClasses[i] == "POSIXct")
as.POSIXct(data[[i]])
else methods::as(data[[i]], colClasses[i])
}
...
df <- read.table(file = "df.csv",
header = TRUE,
sep = "\t",
na.strings = "NA",
quote="\"",
fileEncoding = "UTF-8",
colClasses = c("integer", "numeric", "character")
)
-checking the struture
str(df)
'data.frame': 10 obs. of 3 variables:
$ integers: int 1 2 3 4 5 6 7 8 NA 10
$ doubles : num 1.1 2.1 3.1 4.1 5.1 6.1 7.1 NA 9.1 10.1
$ strings : chr "1." "2." "3." "4." ...
How to convert a integer 123033 to time 12:30:33 format in R
You can use str_pad
from the stringr
package to restore the zeroes:
library(stringr)
time_old <- "2"
time_new <- str_pad(time_old, width = 6, side = "left", pad = 0)
Then, you should be able to use the chron
function:
chron::chron(times = time_new, format = list(times = "hms"),
out.format = "h:m:s")
[1] 00:00:02
R: reading in .csv file removes leading zeros
The read.csv
, read.table
, and related functions read everything in as character strings, then depending on arguments to the function (specifically colClasses
, but also others) and options the function will then try to "simplify" the columns. If enough of the column looks numeric and you have not told the function otherwise, then it will convert it to a numeric column, this will drop any leading 0's (and trailing 0's after the decimal). If there is something in the column that does not look like a number then it will not convert to numeric and either keep it as character or convert to a factor, this keeps the leading 0's. The function does not always look at the entire column to make the decision, so what may be obvious to you as not being numeric may still be converted.
The safest approach (and quickest) is to specify colClasses
so that R does not need to guess (and you do not need to guess what R is going to guess).
Importing data using R from SQL Server truncate leading zeros
Assuming that the underlying data in the DBMS is indeed "string"-like ...
RODBC::sqlQuery
has the as.is=
argument that can prevent it from trying to convert values. The default is FALSE
, and when false and not a clear type like "date"
or "timestamp"
, RODBC calls type.convert
which will see the number-like field and convert it to integers or numbers.
Try:
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = TRUE)
and that will stop auto-conversion of all columns.
That is a bit nuclear, to be honest, and will stop conversion of dates/times, and perhaps other columns that should be converted. We can narrow this down; ?sqlQuery
says that read.table
's documentation on as.is
is relevant, and it says:
as.is: controls conversion of character variables (insofar as they
are not converted to logical, numeric or complex) to factors,
if not otherwise specified by 'colClasses'. Its value is
either a vector of logicals (values are recycled if
necessary), or a vector of numeric or character indices which
specify which columns should not be converted to factors.
so if you know which column (by name or column index) is being unnecessarily converted, then you can include it directly. Perhaps
## by column name
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = "somename")
## or by column index
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = 7)
(Side note: while I use select * ...
on occasion as well, the presumption of knowing columns by-number is predicated on know all of the columns included in that table/query. If anything changes, perhaps it's actually a SQL view and somebody updates it ... or if somebody changes the order of columns, than your assumptions of column indices is a little fragile. All of my "production" queries in my internal packages have all columns spelled out, no use of select *
. I have been bitten once when I used it, which is why I'm a little defensive about it.)
If you don't know, a hastily-dynamic way (that double-taps the query, unfortunately) could be something like
qry10 <- "
select
*
from table_name
limit 10"
x_1 <- sqlQuery(channel=cn_1, query=qry10, stringsAsFactors=FALSE, as.is = TRUE)
leadzero <- sapply(x_1, function(z) all(grepl("^0+[1-9]", z)))
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = which(leadzero))
Caveat: I don't use RODBC
nor have I set up a temporary database with appropriately-fashioned values, so this untested.
`read.csv()` imports text column as numeric
See colClasses
in ?read.csv
:
df = read.csv("data.csv", colClasses="character")
colClasses: character. A vector of classes to be assumed for the
columns. If unnamed, recycled as necessary. If named, names
are matched with unspecified values being taken to be ‘NA’.Possible values are ‘NA’ (the default, when ‘type.convert’ is
used), ‘"NULL"’ (when the column is skipped), one of the
atomic vector classes (logical, integer, numeric, complex,
character, raw), or ‘"factor"’, ‘"Date"’ or ‘"POSIXct"’.
Related Topics
Handling Dates When We Switch to Daylight Savings Time and Back
Set Margin Size When Converting from Markdown to PDF with Pandoc
How to Convert Data.Frame Column from Factor to Numeric
Extract Names of Objects from List
Subfigures or Subcaptions with Knitr
Stacked Barplot with Colour Gradients for Each Bar
Add Number of Observations Per Group in Ggplot2 Boxplot
Merge Dataframes of Different Sizes
How to Sort a Data Frame by Date
Remove All Line Breaks (Enter Symbols) from the String Using R
Why Is Using '<<-' Frowned Upon and How to Avoid It
Should I Use a Data.Frame or a Matrix
Printing Newlines with Print() in R
Connecting Across Missing Values with Geom_Line
Calculate Multiple Aggregations on Several Variables Using Lapply(.Sd, ...)