Export Utf-8 Bom to .CSV in R

Export UTF-8 BOM to .csv in R

On help page to Encoding (help("Encoding")) you could read about special encoding - bytes.

Using this I was able to generate csv file by:

v <- "נווה שאנן"
X <- data.frame(v1=rep(v,3), v2=LETTERS[1:3], v3=0, stringsAsFactors=FALSE)

Encoding(X$v1) <- "bytes"
write.csv(X, "test.csv", row.names=FALSE)

Take care about differences between factor and character. The following should work:

id_characters <- which(sapply(X,
function(x) is.character(x) && Encoding(x)=="UTF-8"))
for (i in id_characters) Encoding(X[[i]]) <- "bytes"

id_factors <- which(sapply(X,
function(x) is.factor(x) && Encoding(levels(x))=="UTF-8"))
for (i in id_factors) Encoding(levels(X[[i]])) <- "bytes"

write.csv(X, "test.csv", row.names=FALSE)

How can i specify encode in fwrite() for export csv file R?

You should post a reproducible example, but I would guess you could do this by making sure the data in DT is in UTF-8 within R, then setting the encoding of each column to "unknown". R will then assume the data is encoded in the native encoding when you write it out.

For example,

DF <- data.frame(text = "á", stringsAsFactors = FALSE)
DF$text <- enc2utf8(DF$text) # Only necessary if Encoding(DF$text) isn't "UTF-8"
Encoding(DF$text) <- "unknown"
data.table::fwrite(DF, "DF.csv", bom = TRUE)

If the columns of DF are factors, you'll need to convert them to character vectors before this will work.

Loading .csv file with UTF-8 encoding error no lines available in input - byte order mark (BOM) ï..

You get the usual BOM problem. BOM is usually used to indicate the byte order on generic UTF-16 and UTF-32 (where byte order is relevant).

Obviously Microsoft think that changing interpretation of existing standard is the way to go (who care about interoperability to non-Microsoft systems?), so they use BOM as indication that the file is UTF-8, to distinguish from other legacy encodings used by DOS and Windows. (Note: Linux and Apple changed default encoding to UTF-8 without need to break stuffs, or adding BOM, and in a much quicker way).

So, UTF-8 files created by Microsoft usually have BOM (0xEF 0xBB 0xBF), which is shown as  on cp1252 (the Microsoft extension to Latin-1).

But most of tools (not made by Microsoft) misinterpret BOM, sometime following the standard, so interpreting it as hidden white space (the original meaning of the codepoint converted in most modern Unicode standard as BOM), or just seeing it as binary data, so just ignoring data.

For this reason, now we have the encoding utf-8-bom, which just skip the initial BOM (or create the BOM at writing). This usually fix the issue.

How to export a csv in utf-8 format?

Try opening a UTF8 connection:

con<-file('filename',encoding="UTF-8")
write.csv(...,file=con,...)


Related Topics



Leave a reply



Submit