Reading Rdata File with Different Encoding

Reading Rdata file with different encoding

Thanks to 42's comment, I've managed to write a function to recode the file:

fix.encoding <- function(df, originalEncoding = "latin1") {
  numCols <- ncol(df)
  for (col in 1:numCols) Encoding(df[, col]) <- originalEncoding
  return(df)
}

The meat here is the command Encoding(df[, col]) <- "latin1", which takes column col of dataframe df and converts it to latin1 format. Unfortunately, Encoding only takes column objects as input, so I had to create a function to sweep all columns of a dataframe object and apply the transformation.

Of course, if your problem is in just a couple of columns, you're better off just applying the Encoding to those columns instead of the whole dataframe (you can modify the function above to take a set of columns as input). Also, if you're facing the inverse problem, i.e. reading an R object created in Linux or Mac OS into Windows, you should use originalEncoding = "UTF-8".

Change encoding type when parsing RData file into Python using Rdata package

I am the autor of the rdata package.

The convert function accepts the keyword parameter default_encoding, that you can use to specify the encoding used when it is not explicitly declared in the string.

You can also use the force_default_encoding if the encoding is explicitly declared but wrong.

Your code would be then:

import rdata
parsed = rdata.parser.parse_file("news_dataset.rda")
converted = rdata.conversion.convert(parsed, default_encoding="utf8")
converted_df = pd.DataFrame(converted.get("df_final"))

If you have further doubts about the package, feel free to open a discussion in the Github repo. I am notified of those and can usually answer in the same day.

Read in file with UTF-8 character in path in R

At first I thought your locale was the problem; windows-1252 doesn't contain "Ń". But I couldn't reproduce your error even with filenames like ".rds" with latin1 encoding and german locale.

But the amount of whitespace in your error was more that I got for files that didn't exist... Then I spotted the leading space in your example output.

[1] " G:/Users/SomeUser/Documents/University/2021/Project_M/data/procyclingstats/BORA_hansgrohe /POLJAŃSKI_Paweł_sprinter_point.rds"

That could explain why it prints "okay" (we don't see whitespace), but trying to read would fail. It does leave me puzzled about why your other files read without problem.

If that isn't the problem than it may be the relative recent support for utf-8 in Windows. Historically they have used ucs-2 and utf-16 internally. "Turning on" utf-8 support requires a different C runtime. There is an experimental build of R that you could try out that uses that runtime. But that requires you to rebuild your libraries (readr!) with that runtime too.

Before messing up your whole R installation, I'd test with the experimental build if you can read a file called Ń.csv.

How to read data from a CSV file of two possible encodings?

Once you know what encoding your file has, you can pass inside the CSV options i.e.

external_encoding: Encoding::ISO_8859_15, 
internal_encoding: Encoding::UTF_8

(This would establish, that the file is ISO-8859-15, but you want the strings internally as UTF-8).

So the strategy is that you decided first (before opening the file), what encoding you want, and then use the appropriate option Hash.

How to change encoding (using latin encoding) in pyreadr Python

Unfortunately it is currently not possible. From the README in the known limitations section:

Cannot read RData or rds files in encodings other than utf-8.

This feature would need to be suported by the underlying C library librdata. You can open an issue in librdata suggesting to implement this feature, together with a sample file.

Loading .RData files into Python

People ask this sort of thing on the R-help and R-dev list and the usual answer is that the code is the documentation for the .RData file format. So any other implementation in any other language is hard++.

I think the only reasonable way is to install RPy2 and use R's load function from that, converting to appropriate python objects as you go. The .RData file can contain structured objects as well as plain tables so watch out.

Linky: http://rpy.sourceforge.net/rpy2/doc-2.4/html/

Quicky:

>>> import rpy2.robjects as robjects
>>> robjects.r['load'](".RData")

objects are now loaded into the R workspace.

>>> robjects.r['y']
<FloatVector - Python:0x24c6560 / R:0xf1f0e0>
[0.763684, 0.086314, 0.617097, ..., 0.443631, 0.281865, 0.839317]

That's a simple scalar, d is a data frame, I can subset to get columns:

>>> robjects.r['d'][0]
<IntVector - Python:0x24c9248 / R:0xbbc6c0>
[       1,        2,        3, ...,        8,        9,       10]
>>> robjects.r['d'][1]
<FloatVector - Python:0x24c93b0 / R:0xf1f230>
[0.975648, 0.597036, 0.254840, ..., 0.891975, 0.824879, 0.870136]

Library data table - fread read multiple csv with encoding = UTF-8

You can use -

library(data.table)
cirium <- rbindlist(lapply(filenames,fread, encoding = "UTF-8"))

Or to be clear you can also use an anonymous function.

cirium <- rbindlist(lapply(filenames, function(x) fread(x, encoding = "UTF-8")))

Reading Rdata File with Different Encoding