Avoid That Space in Column Name Is Replaced with Period (".") When Using Read.Csv()

Avoid that space in column name is replaced with period (.) when using read.csv()

If your set check.names=FALSE in read.csv when you read the data in then the names will not be changed and you will not need to edit them before writing the data back out. This of course means that you would need quote the column names (back quotes in some cases) or refer to the columns by location rather than name while editing.

R: Import CSV with column names that contain spaces

Unless you specify check.names=FALSE, R will convert column names that are not valid variable names (e.g. contain spaces or special characters or start with numbers) into valid variable names, e.g. by replacing spaces with dots. Try names(s_data). If you do use check.names=TRUE, then use single back-quotes (`) to surround the names.

I would also recommend using rename from the reshape package (or, these days, dplyr::rename).

s_data <- read.csv2( file=f_name )
library(reshape)
s_df <- rename(s_data,ID="scada_id",
PlantNo="plant",DateTime="date",Main.status="main_code",
Additional.status="seco_code",MainStatustext="main_text",
AddStatustext="seco_test",Duration="duration")

For what it's worth, the tidyverse tools (i.e. readr::read_csv) have the opposite default; they don't transform the column names to make them legal R symbols unless you explicitly request it.

R read.csv More columns than column names error

That's one wonky CSV file. Multiple headers tossed about (try pasting it to CSV Fingerprint) to see what I mean.

Since I don't know the data, it's impossible to be sure the following produces accurate results for you, but it involves using readLines and other R functions to pre-process the text:

# use readLines to get the data
dat <- readLines("N0_07312014.CSV")

# i had to do this to fix grep errors
Sys.setlocale('LC_ALL','C')

# filter out the repeating, and wonky headers
dat_2 <- grep("Node Name,RTC_date", dat, invert=TRUE, value=TRUE)

# turn that vector into a text connection for read.csv
dat_3 <- read.csv(textConnection(paste0(dat_2, collapse="\n")),
header=FALSE, stringsAsFactors=FALSE)

str(dat_3)
## 'data.frame': 308 obs. of 37 variables:
## $ V1 : chr "Node 0" "Node 0" "Node 0" "Node 0" ...
## $ V2 : chr "07/31/2014" "07/31/2014" "07/31/2014" "07/31/2014" ...
## $ V3 : chr "08:58:18" "08:59:22" "08:59:37" "09:00:06" ...
## $ V4 : chr "" "" "" "" ...
## .. more
## $ V36: chr "" "" "" "" ...
## $ V37: chr "0" "0" "0" "0" ...

# grab the headers
headers <- strsplit(dat[1], ",")[[1]]

# how many of them are there?
length(headers)
## [1] 32

# limit it to the 32 columns you want (Which matches)
dat_4 <- dat_3[,1:32]

# and add the headers
colnames(dat_4) <- headers

str(dat_4)
## 'data.frame': 308 obs. of 32 variables:
## $ Node Name : chr "Node 0" "Node 0" "Node 0" "Node 0" ...
## $ RTC_date : chr "07/31/2014" "07/31/2014" "07/31/2014" "07/31/2014" ...
## $ RTC_time : chr "08:58:18" "08:59:22" "08:59:37" "09:00:06" ...
## $ N1 Bat (VDC) : chr "" "" "" "" ...
## $ N1 Shinyei (ug/m3): chr "" "" "0.23" "null" ...
## $ N1 CC (ppb) : chr "" "" "null" "null" ...
## $ N1 Aeroq (ppm) : chr "" "" "null" "null" ...
## ... continues

read.csv replaces column-name characters like `?` with `.`, `-` with `...`

This is expected behavior by read.csv, not a truncation problem in R. When you have spaces and special characters in the column names of a file, read.csv replaces each of them with a . unless you specify check.names = FALSE

Here's a glimpse at make.names, which is how read.table produces the column names.

nm <- "Translation Service Info - Which translation service?"
make.names(nm)
# [1] "Translation.Service.Info...Which.translation.service."

And here's the relevant line from read.table

if (check.names) 
col.names <- make.names(col.names, unique = TRUE)

PySpark remove special characters in all column names for all special characters

You can substitute any character except A-z and 0-9

import pyspark.sql.functions as F
import re

df = df.select([F.col(col).alias(re.sub("[^0-9a-zA-Z$]+","",col)) for col in df.columns])

Pandas replace a character in all column names

Use str.replace:

df.columns = df.columns.str.replace("[()]", "_")

Sample:

df = pd.DataFrame({'(A)':[1,2,3],
'(B)':[4,5,6],
'C)':[7,8,9]})

print (df)
(A) (B) C)
0 1 4 7
1 2 5 8
2 3 6 9

df.columns = df.columns.str.replace(r"[()]", "_")
print (df)
_A_ _B_ C_
0 1 4 7
1 2 5 8
2 3 6 9

Pandas read_csv: low_memory and dtype options

The deprecated low_memory option

The low_memory option is not properly deprecated, but it should be, since it does not actually do anything differently[source]

The reason you get this low_memory warning is because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.

Dtype Guessing (very bad)

Pandas can only determine what dtype a column should have once the whole file is read. This means nothing can really be parsed before the whole file is read unless you risk having to change the dtype of that column when you read the last value.

Consider the example of one file which has a column called user_id.
It contains 10 million rows where the user_id is always numbers.
Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file.

Specifying dtypes (should always be done)

adding

dtype={'user_id': int}

to the pd.read_csv() call will make pandas know when it starts reading the file, that this is only integers.

Also worth noting is that if the last line in the file would have "foobar" written in the user_id column, the loading would crash if the above dtype was specified.

Example of broken data that breaks when dtypes are defined

import pandas as pd
try:
from StringIO import StringIO
except ImportError:
from io import StringIO

csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": "string"})

ValueError: invalid literal for long() with base 10: 'foobar'

dtypes are typically a numpy thing, read more about them here:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

What dtypes exists?

We have access to numpy dtypes: float, int, bool, timedelta64[ns] and datetime64[ns]. Note that the numpy date/time dtypes are not time zone aware.

Pandas extends this set of dtypes with its own:

'datetime64[ns, <tz>]' Which is a time zone aware timestamp.

'category' which is essentially an enum (strings represented by integer keys to save

'period[]' Not to be confused with a timedelta, these objects are actually anchored to specific time periods

'Sparse', 'Sparse[int]', 'Sparse[float]' is for sparse data or 'Data that has a lot of holes in it' Instead of saving the NaN or None in the dataframe it omits the objects, saving space.

'Interval' is a topic of its own but its main use is for indexing. See more here

'Int8', 'Int16', 'Int32', 'Int64', 'UInt8', 'UInt16', 'UInt32', 'UInt64' are all pandas specific integers that are nullable, unlike the numpy variant.

'string' is a specific dtype for working with string data and gives access to the .str attribute on the series.

'boolean' is like the numpy 'bool' but it also supports missing data.

Read the complete reference here:

Pandas dtype reference

Gotchas, caveats, notes

Setting dtype=object will silence the above warning, but will not make it more memory efficient, only process efficient if anything.

Setting dtype=unicode will not do anything, since to numpy, a unicode is represented as object.

Usage of converters

@sparrow correctly points out the usage of converters to avoid pandas blowing up when encountering 'foobar' in a column specified as int. I would like to add that converters are really heavy and inefficient to use in pandas and should be used as a last resort. This is because the read_csv process is a single process.

CSV files can be processed line by line and thus can be processed by multiple converters in parallel more efficiently by simply cutting the file into segments and running multiple processes, something that pandas does not support. But this is a different story.

How to read the csv file properly if each row contains different number of fields (number quite big)?

As suggested, DictReader could also be used as follows to create a list of rows. This could then be imported as a frame in pandas:

import pandas as pd
import csv

rows = []
csv_header = ['user', 'item', 'time', 'rating', 'review']
frame_header = ['user', 'item', 'rating', 'review']

with open('input.csv', 'rb') as f_input:
for row in csv.DictReader(f_input, delimiter=' ', fieldnames=csv_header[:-1], restkey=csv_header[-1], skipinitialspace=True):
try:
rows.append([row['user'], row['item'], row['rating'], ' '.join(row['review'])])
except KeyError, e:
rows.append([row['user'], row['item'], row['rating'], ' '])

frame = pd.DataFrame(rows, columns=frame_header)
print frame

This would display the following:

         user      item rating                                  review
0 disjiad123 TYh23hs9 5 I love this phone as it is easy to use
1 hjf2329ccc TGjsk123 3 Suck restaurant

If the review appears at the start of the row, then one approach would be to parse the line in reverse as follows:

import pandas as pd
import csv

rows = []
frame_header = ['rating', 'time', 'item', 'user', 'review']

with open('input.csv', 'rb') as f_input:
for row in f_input:
cols = [col[::-1] for col in row[::-1][2:].split(' ') if len(col)]
rows.append(cols[:4] + [' '.join(cols[4:][::-1])])

frame = pd.DataFrame(rows, columns=frame_header)
print frame

This would display:

  rating      time      item        user  \
0 5 13160032 TYh23hs9 isjiad123
1 3 14423321 TGjsk123 hjf2329ccc

review
0 I love this phone as it is easy to used
1 Suck restaurant

row[::-1] is used to reverse the text of the whole line, the [2:] skips over the line ending which is now at the start of the line. Each line is then split on spaces. A list comprehension then re-reverses each split entry. Finally rows is appended to first by taking the fixed 5 column entries (now at the start). The remaining entries are then joined back together with a space and added as the final column.

The benefit of this approach is that it does not rely on your input data being in an exactly fixed width format, and you don't have to worry if the column widths being used change over time.



Related Topics



Leave a reply



Submit