How can I declare a thousand separator in read.csv?
since there is an "r" tag under the question, I assume this is an R question.
In R, you do not need to do anything to handle the quoted commas:
> read.csv('t.csv', header=F)
V1 V2 V3 V4
1 Sudan 15,276,000 14,098,000 13,509,000
2 Chad 209000 196000 190000
# if you want to convert them to numbers:
> df <- read.csv('t.csv', header=F, stringsAsFactor=F)
> df$V2 <- as.numeric(gsub(',', '', df$V2))
Read csv with numeric columns containing thousands separator
You should be able to read the data with read.csv
. Here an example
#write data
write('Date,x,y\n"2015/08/01","71,131","20,390"\n"2015/08/02","81,599","23,273"\n"2015/08/03","79,435","21,654"\n"2015/08/04","80,733","20,924"',"test.csv")
#use "text" rather than "file" in read.csv
#perform regex substitution before using read.csv
#the outer gsub with '(?<=\\d),(\\d{3})(?!\\d)' performs the thousands separator substitution
#the inner gsub replaces all \" with '
read.csv(text=gsub('(?<=\\d),(\\d{3})(?!\\d)',
'\\1',
gsub("\\\"",
"'",
paste0(readLines("test.csv"),collapse="\n")),
perl=TRUE),
header=TRUE,
quote="'",
stringsAsFactors=FALSE)
The result
# Date x y
#1 2015/08/01 71131 20390
#2 2015/08/02 81599 23273
#3 2015/08/03 79435 21654
#4 2015/08/04 80733 20924
pandas reading CSV data formatted with comma for thousands separator
Pass param thousands=','
to read_csv
to read those values as thousands:
In [27]:
import pandas as pd
import io
t="""id;value
0;123,123
1;221,323,330
2;32,001"""
pd.read_csv(io.StringIO(t), thousands=r',', sep=';')
Out[27]:
id value
0 0 123123
1 1 221323330
2 2 32001
Conflict between thousand separator and date format - pandas.read_csv
I have managed to solve the problem.
df = pd.read_csv(filepath, sep=";", header=5, decimal=",", thousands = ".", parse_dates=['Datum'], date_parser = lambda x: datetime.strptime(x, '%d.%m.%Y'))
df['Datum'] = df['Datum'].dt.strftime("%d.%m.%Y")
Problem was because thousands separator was ".", I somehow managed to format the date like I wanted afterwards and now everything works good.
Appreciate all help!
Use Non breakable space as thousands separator in pandas read_csv function
pd.read_csv
supports two parser engines: C and Python. According to the doc,
The C engine is faster while the python engine is currently more feature-complete.
I did some tests and it looked like the C engine -- which is the default choice in most cases -- can only deal with thousands and decimal separators that are basic ASCII letters ('\x0'
- '\x7f'
); using '\xa0'
as the thousands separator is only supported in the Python engine.
data = "0,11;1\xa0279,92;1\xa0324,21;1\xa0302,14;10,65;2\xa0707,77;2\xa0951,71;2\xa0829,40"
df = pd.read_csv(io.StringIO(data), header=None, encoding="iso-8859-1",
sep=';', decimal=',', thousands='\xa0', engine="python")
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 1 non-null float64
1 1 1 non-null float64
2 2 1 non-null float64
3 3 1 non-null float64
4 4 1 non-null float64
5 5 1 non-null float64
6 6 1 non-null float64
7 7 1 non-null float64
dtypes: float64(8)
memory usage: 192.0 bytes
Most elegant way to load csv with point as thousands separator in R
Adapted from this post: Specify custom Date format for colClasses argument in read.table/read.csv
#some sample data
write.csv(data.frame(a=c("1.234,56", "1.234,56"),
b=c("1.234,56", "1.234,56")),
"test.csv", row.names=FALSE, quote=TRUE)
#define your own numeric class
setClass('myNum')
#define conversion
setAs("character", "myNum",
function(from) as.numeric(gsub(",", "\\.", gsub("\\.", "", from))))
#read data with custom colClasses
read_data = read.csv("test.csv",
stringsAsFactors=FALSE,
colClasses=c("myNum", "myNum"))
#let's try whether this is really a numeric
read_data[1, 1] * 2
#[1] 2469.12
Pandas: Read csv with quoted values, comma as decimal separator, and period as digit grouping symbol
What about that ?
import pandas
table = pandas.read_csv("data.csv", sep=";", decimal=",")
print(table["Amount"][0]) # -36.37
print(type(table["Amount"][0])) # <class 'numpy.float64'>
print(table["Amount"][0] + 36.37) # 0.0
Pandas automatically detects a number and converts it to numpy.float64
.
Edit:
As @bweber discovered, some values in data.csv
contained more than 3 digits, and used a digit grouping symbol '.
'. In order to convert the String to Integer, the symbol used must be passed to the read_csv() method:
table = pandas.read_csv("data.csv", sep=";", decimal=",", thousands='.')
How to read .csv-data containing thousand separators and special handling of zeros (in R)?
If "," is the only separator, i.e. all of the numbers are integers, you can set the dec
argument of csv2
(or read.csv
) to "," and multiply by 1000:
data <- read.csv2(
text = "id ; variable1
1 ; 2,001
1,008 ; 2,001
1,009 ; 2,002
1,01 ; 2,001
1,3 ; 2,0",
sep = ";",
stringsAsFactors = FALSE,
header = TRUE,
dec = "," )
.
> 1000*data
id variable1
1 1000 2001
2 1008 2001
3 1009 2002
4 1010 2001
5 1300 2000
>
Read CSV file with space as thousand-seperator using pandas.read_csv
If you have non-breaking spaces, I would suggest a more aggressive regular expression with str.replace
:
df.col1 = df.col1.str.replace('[^\d.,e+-]', '')\
.str.replace(',', '.').astype(float)
Regex
[ # character group
^ # negation - ignore everything in this character group
\d # digit
. # dot
e # 'e' - exponent
+- # signs
]
How to read pandas CSV file with comma separator and comma thousand separator
Try reading it with:
pd.read_csv(myfile, encoding='latin1', quotechar='"')
Each column that contains these will be treated as type object
.
Once you get this, to get back to float use:
df = df.apply(lambda x: pd.to_numeric(x.astype(str).str.replace(',',''), errors='coerce'))
Alternatively you can try:
pd.read_csv(myfile, encoding='latin1', quotechar='"', error_bad_lines=False)
Here you can see what was omitted from original csv
- what caused the problem.
For each line that was omitted you'll receive a Warning
instead of Error
.
Related Topics
Canonical Tidyverse Method to Update Some Values of a Vector from a Look-Up Table
How to View an HTML Table in the Viewer Pane
Remove Unused Factor Levels from a Ggplot Bar Plot
Dplyr::Do() Requires Named Function
Got Message Unable to Load Shared Object Stats.So When R Starts
R Optimization with Equality and Inequality Constraints
Merge Overlapping Ranges into Unique Groups, in Dataframe
Consolidating Data Frames in R
Subset Data.Table by Logical Column
Can't Run Rcpp Function in Foreach - "Null Value Passed as Symbol Address"
Add Row in Each Group Using Dplyr and Add_Row()
Represent Numeric Value with Typical Dollar Amount Format
How to Have Na's Displayed First Using Arrange()
Making Plot Functions with Ggplot and Aes_String