Invalid multibyte string in read.csv
Encoding
sets the encoding of a character string. It doesn't set the encoding of the file represented by the character string, which is what you want.
This worked for me, after trying "UTF-8"
:
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE, fileEncoding="latin1")
And you may want to skip the first 16 lines, and read in the headers separately. Either way, there's still quite a bit of cleaning up to do.
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE,
fileEncoding="latin1", skip=16)
# get started with the clean-up
x[,1] <- gsub("\u0081|`", "", x[,1]) # get rid of odd characters
x[,-1] <- as.data.frame(lapply(x[,-1], # convert to numbers
function(d) type.convert(gsub(d, pattern=",", replace=""))))
invalid multibyte string 8 error popping up for read.csv in R version 4.2.0
The default behaviour for R for versions < 4.2 has been:
If you don't set a default encoding, files will be opened using UTF-8
(on Mac desktop, Linux desktop, and server) or the system's default
encoding (on Windows).
This behaviour has changed in R 4.2:
R 4.2 for Windows will support UTF-8 as native encoding
To find out the default encoding on Windows 10, run the following Powershell command:
[System.Text.Encoding]::Default
The output for this on my Windows 10 machine is:
IsSingleByte : True
BodyName : iso-8859-1
EncodingName : Western European (Windows)
HeaderName : Windows-1252
WebName : Windows-1252
WindowsCodePage : 1252
IsBrowserDisplay : True
IsBrowserSave : True
IsMailNewsDisplay : True
IsMailNewsSave : True
EncoderFallback : System.Text.InternalEncoderBestFitFallback
DecoderFallback : System.Text.InternalDecoderBestFitFallback
IsReadOnly : True
CodePage : 1252
This can be passed to read.csv
as the encoding to use:
read.csv(path_to_file, encoding = "windows-1252")
If you are unsure how to translate the output from Powershell into the relevant string, you can search the list of all encodings with the stringi
package:
# Replace "1252" with the relevant output from the Powershell command
cat(grep("1252", stringi::stri_enc_list(simplify = FALSE), value = TRUE, ignore.case = TRUE))
You can take your pick from any of the options in the output:
# c("ibm-1252", "ibm-1252_P100-2000", "windows-1252") c("cp1252", "ibm-5348", "ibm-5348_P100-1997", "windows-1252")
R: invalid multibyte string
I realize this is pretty late, but I had a similar problem and I figured I'd post what worked for me. I used the iconv utility (e.g., "iconv file.pcl -f UTF-8 -t ISO-8859-1 -c"
). The "-c" option skips characters that can't be translated.
Read.table() invalid multibyte string error: Find the strings causing the error
The problem is in one of your column names which contains the ü
character. Use check.names = FALSE
in your read.csv2
:
dat <- read.csv2("dat.csv", check.names = FALSE)
this will read you file correctly:
> head(dat)
ISIN WKN SecurityType Bezeichnung Anlageuniversum (Gruppe) Anlageuniversum Whitelist f\x81r institutionelle Produkte _ Schweiz
1 AN8068571086 853390 Stock SCHLUMBERGER Aktien Europa Aktien Europa Select X
2 AT000000STR1 A0M23V Stock STRABAG Aktien Europa Aktien Europa Select X
3 AT00000AMAG3 A1JFYU Stock AMAG AUSTRIA METALL AG Aktien Europa Aktien Europa Select X
4 AT00000ATEC9 A0LFDH Stock A-TEC INDUSTRIES Aktien Europa Aktien Europa Select X
5 AT00000BENE6 A0LCPZ Stock BENE AG Aktien Europa Aktien Europa Select X
6 AT00000FACC2 A1147K Stock FACC AG Aktien Europa Aktien Europa Select X
Then you can change your column names with for example:
names(dat) <- c("ISIN","WKN","SecurityType","Bezeichnung",
"Anlageuniversum_Gruppe","Anlageuniversum","Whitelist_Schweiz")
Another possibility is reading your file without the headers:
dat <- read.csv2("dat.csv", header = FALSE, skip = 1)
Error while reading csv file in R
You need to specify the correct delimiter in the sep
argument.
Related Topics
Getting Warning: " 'Newdata' Had 1 Row But Variables Found Have 32 Rows" on Predict.Lm
Unique Rows, Considering Two Columns, in R, Without Order
Collapsing Rows Where Some Are All Na, Others Are Disjoint With Some Nas
Create Sequence of Repeated Values, in Sequence
Finding Rows Containing a Value (Or Values) in Any Column
Assign Multiple Objects to .Globalenv from Within a Function
How to Unload a Package Without Restarting R
Rcpp Pass by Reference Vs. by Value
Dplyr Join on By=(A = B), Where a and B Are Variables Containing Strings
Sample N Random Rows Per Group in a Dataframe
Why Do R Objects Not Print in a Function or a "For" Loop
Fastest Way to Find Second (Third...) Highest/Lowest Value in Vector or Column
Reshape Multiple Values At Once
Sum Values in a Rolling/Sliding Window
How to Calculate Mean/Median Per Group in a Dataframe in R
What Does .Sd Stand For in Data.Table in R
Cumulatively Paste (Concatenate) Values Grouped by Another Variable