How to Make Separator in Pandas Read_CSV More Flexible Wrt Whitespace, for Irregular Separators

How to make separator in pandas read_csv more flexible wrt whitespace, for irregular separators?

From the documentation, you can use either a regex or delim_whitespace:

>>> import pandas as pd
>>> for line in open("whitespace.csv"):
...     print repr(line)
...     
'a\t  b\tc 1 2\n'
'd\t  e\tf 3 4\n'
>>> pd.read_csv("whitespace.csv", header=None, delimiter=r"\s+")
   0  1  2  3  4
0  a  b  c  1  2
1  d  e  f  3  4
>>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True)
   0  1  2  3  4
0  a  b  c  1  2
1  d  e  f  3  4

Can pandas handle variable-length whitespace as column delimiters

I think there's just a missing \ in the docs (maybe because it was interpreted as an escape marker at some point?) It's a regexp, after all:

In [68]: data = read_table('sample.txt', skiprows=3, header=None, sep=r"\s*")

In [69]: data
Out[69]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 6
Data columns:
X.1     7  non-null values
X.2     7  non-null values
X.3     7  non-null values
X.4     7  non-null values
X.5     7  non-null values
X.6     7  non-null values
[...]
X.23    7  non-null values
X.24    7  non-null values
X.25    5  non-null values
X.26    3  non-null values
dtypes: float64(8), int64(10), object(8)

Because of the delimiter problem noted by @MRAB, it has some trouble with the last few columns:

In [73]: data.ix[:,20:]
Out[73]: 
   X.21  X.22           X.23                   X.24            X.25    X.26
0   315  0.95            ABC            transporter   transmembrane  region
1   527  0.93            ABC            transporter            None    None
2   408  0.86  RecF/RecN/SMC                      N        terminal  domain
3   575  0.85  RecF/RecN/SMC                      N        terminal  domain
4   556  0.72            AAA                 ATPase          domain    None
5   275  0.85      YceG-like                 family            None    None
6   200  0.85       Pyridine  nucleotide-disulphide  oxidoreductase    None

but that can be patched up at the end.

How to make read_csv more flexibile with numbers and whitespaces

Id try it like this.

Id manipulate the text file before I attempt to parse it to a dataframe as follows:

import pandas as pd
import re

f = open("whitespace.txt", "r")
g = f.read().replace("\n", " ")

prepared_text = re.sub(r'(\d+,\d+)', r'\1@', g)

df = pd.DataFrame({'My columns':prepared_text.split('@')})
print(df)

This gives the following:

                                          My columns
0  I am an example of a dataframe I have Problems...
1                         So How can I read it 20,00
2

I guess this'd suffice as long as the input file wasnt too large but using the re module and substitiution gives you the control you seek.

The (\d+,\d+) parentheses mark a group which we want to match. We're basically matching any of your numbers in your text file.
Then we use the \1 which is called a backreference to the matched group which is referred to when specifying a replacement. So \d+,\d+ is replaced by \d+,\d+@.

Then we use the inserted character as a delimiter.

There are some good examples here:

https://lzone.de/examples/Python%20re.sub

How to read file with space separated values in pandas

add delim_whitespace=True argument, it's faster than regex.

Read Space-separated Data with Pandas

Your original line:

pd.read_csv(filename, sep=' ',header=None)

was specifying the separator as a single space, because your csvs can have spaces or tabs you can pass a regular expression to the sep param like so:

pd.read_csv(filename, sep='\s+',header=None)

This defines separator as being one single white space or more, there is a handy cheatsheet that lists regular expressions.

How to Make Separator in Pandas Read_CSV More Flexible Wrt Whitespace, for Irregular Separators