How to Make Separator in Pandas Read_CSV More Flexible Wrt Whitespace, for Irregular Separators

How to make separator in pandas read_csv more flexible wrt whitespace, for irregular separators?

From the documentation, you can use either a regex or delim_whitespace:

>>> import pandas as pd
>>> for line in open("whitespace.csv"):
... print repr(line)
...
'a\t b\tc 1 2\n'
'd\t e\tf 3 4\n'
>>> pd.read_csv("whitespace.csv", header=None, delimiter=r"\s+")
0 1 2 3 4
0 a b c 1 2
1 d e f 3 4
>>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True)
0 1 2 3 4
0 a b c 1 2
1 d e f 3 4

Can pandas handle variable-length whitespace as column delimiters

I think there's just a missing \ in the docs (maybe because it was interpreted as an escape marker at some point?) It's a regexp, after all:

In [68]: data = read_table('sample.txt', skiprows=3, header=None, sep=r"\s*")

In [69]: data
Out[69]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 6
Data columns:
X.1 7 non-null values
X.2 7 non-null values
X.3 7 non-null values
X.4 7 non-null values
X.5 7 non-null values
X.6 7 non-null values
[...]
X.23 7 non-null values
X.24 7 non-null values
X.25 5 non-null values
X.26 3 non-null values
dtypes: float64(8), int64(10), object(8)

Because of the delimiter problem noted by @MRAB, it has some trouble with the last few columns:

In [73]: data.ix[:,20:]
Out[73]:
X.21 X.22 X.23 X.24 X.25 X.26
0 315 0.95 ABC transporter transmembrane region
1 527 0.93 ABC transporter None None
2 408 0.86 RecF/RecN/SMC N terminal domain
3 575 0.85 RecF/RecN/SMC N terminal domain
4 556 0.72 AAA ATPase domain None
5 275 0.85 YceG-like family None None
6 200 0.85 Pyridine nucleotide-disulphide oxidoreductase None

but that can be patched up at the end.

How to make read_csv more flexibile with numbers and whitespaces

Id try it like this.

Id manipulate the text file before I attempt to parse it to a dataframe as follows:

import pandas as pd
import re

f = open("whitespace.txt", "r")
g = f.read().replace("\n", " ")

prepared_text = re.sub(r'(\d+,\d+)', r'\1@', g)

df = pd.DataFrame({'My columns':prepared_text.split('@')})
print(df)

This gives the following:

                                          My columns
0 I am an example of a dataframe I have Problems...
1 So How can I read it 20,00
2

I guess this'd suffice as long as the input file wasnt too large but using the re module and substitiution gives you the control you seek.

The (\d+,\d+) parentheses mark a group which we want to match. We're basically matching any of your numbers in your text file.
Then we use the \1 which is called a backreference to the matched group which is referred to when specifying a replacement. So \d+,\d+ is replaced by \d+,\d+@.

Then we use the inserted character as a delimiter.

There are some good examples here:

https://lzone.de/examples/Python%20re.sub

How to read file with space separated values in pandas

add delim_whitespace=True argument, it's faster than regex.

Read Space-separated Data with Pandas

Your original line:

pd.read_csv(filename, sep=' ',header=None)

was specifying the separator as a single space, because your csvs can have spaces or tabs you can pass a regular expression to the sep param like so:

pd.read_csv(filename, sep='\s+',header=None)

This defines separator as being one single white space or more, there is a handy cheatsheet that lists regular expressions.



Related Topics



Leave a reply



Submit