Invalid Multibyte String in Read.Csv

Invalid multibyte string in read.csv

Encoding sets the encoding of a character string. It doesn't set the encoding of the file represented by the character string, which is what you want.

This worked for me, after trying "UTF-8":

x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE, fileEncoding="latin1")

And you may want to skip the first 16 lines, and read in the headers separately. Either way, there's still quite a bit of cleaning up to do.

x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE,
fileEncoding="latin1", skip=16)
# get started with the clean-up
x[,1] <- gsub("\u0081|`", "", x[,1]) # get rid of odd characters
x[,-1] <- as.data.frame(lapply(x[,-1], # convert to numbers
function(d) type.convert(gsub(d, pattern=",", replace=""))))

invalid multibyte string 8 error popping up for read.csv in R version 4.2.0

The default behaviour for R for versions < 4.2 has been:

If you don't set a default encoding, files will be opened using UTF-8
(on Mac desktop, Linux desktop, and server) or the system's default
encoding (on Windows).

This behaviour has changed in R 4.2:

R 4.2 for Windows will support UTF-8 as native encoding

To find out the default encoding on Windows 10, run the following Powershell command:

[System.Text.Encoding]::Default

The output for this on my Windows 10 machine is:

IsSingleByte      : True
BodyName : iso-8859-1
EncodingName : Western European (Windows)
HeaderName : Windows-1252
WebName : Windows-1252
WindowsCodePage : 1252
IsBrowserDisplay : True
IsBrowserSave : True
IsMailNewsDisplay : True
IsMailNewsSave : True
EncoderFallback : System.Text.InternalEncoderBestFitFallback
DecoderFallback : System.Text.InternalDecoderBestFitFallback
IsReadOnly : True
CodePage : 1252

This can be passed to read.csv as the encoding to use:

read.csv(path_to_file, encoding = "windows-1252")

If you are unsure how to translate the output from Powershell into the relevant string, you can search the list of all encodings with the stringi package:

# Replace "1252" with the relevant output from the Powershell command
cat(grep("1252", stringi::stri_enc_list(simplify = FALSE), value = TRUE, ignore.case = TRUE))

You can take your pick from any of the options in the output:

# c("ibm-1252", "ibm-1252_P100-2000", "windows-1252") c("cp1252", "ibm-5348", "ibm-5348_P100-1997", "windows-1252")

R: invalid multibyte string

I realize this is pretty late, but I had a similar problem and I figured I'd post what worked for me. I used the iconv utility (e.g., "iconv file.pcl -f UTF-8 -t ISO-8859-1 -c"). The "-c" option skips characters that can't be translated.

Read.table() invalid multibyte string error: Find the strings causing the error

The problem is in one of your column names which contains the ü character. Use check.names = FALSE in your read.csv2:

 dat <- read.csv2("dat.csv", check.names = FALSE)

this will read you file correctly:

> head(dat)
ISIN WKN SecurityType Bezeichnung Anlageuniversum (Gruppe) Anlageuniversum Whitelist f\x81r institutionelle Produkte _ Schweiz
1 AN8068571086 853390 Stock SCHLUMBERGER Aktien Europa Aktien Europa Select X
2 AT000000STR1 A0M23V Stock STRABAG Aktien Europa Aktien Europa Select X
3 AT00000AMAG3 A1JFYU Stock AMAG AUSTRIA METALL AG Aktien Europa Aktien Europa Select X
4 AT00000ATEC9 A0LFDH Stock A-TEC INDUSTRIES Aktien Europa Aktien Europa Select X
5 AT00000BENE6 A0LCPZ Stock BENE AG Aktien Europa Aktien Europa Select X
6 AT00000FACC2 A1147K Stock FACC AG Aktien Europa Aktien Europa Select X

Then you can change your column names with for example:

names(dat) <- c("ISIN","WKN","SecurityType","Bezeichnung",
"Anlageuniversum_Gruppe","Anlageuniversum","Whitelist_Schweiz")

Another possibility is reading your file without the headers:

dat <- read.csv2("dat.csv", header = FALSE, skip = 1)

Error while reading csv file in R

You need to specify the correct delimiter in the sep argument.



Related Topics



Leave a reply



Submit