How to make separator in pandas read_csv more flexible wrt whitespace, for irregular separators?
From the documentation, you can use either a regex or delim_whitespace
:
>>> import pandas as pd
>>> for line in open("whitespace.csv"):
... print repr(line)
...
'a\t b\tc 1 2\n'
'd\t e\tf 3 4\n'
>>> pd.read_csv("whitespace.csv", header=None, delimiter=r"\s+")
0 1 2 3 4
0 a b c 1 2
1 d e f 3 4
>>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True)
0 1 2 3 4
0 a b c 1 2
1 d e f 3 4
Can pandas handle variable-length whitespace as column delimiters
I think there's just a missing \
in the docs (maybe because it was interpreted as an escape marker at some point?) It's a regexp, after all:
In [68]: data = read_table('sample.txt', skiprows=3, header=None, sep=r"\s*")
In [69]: data
Out[69]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 6
Data columns:
X.1 7 non-null values
X.2 7 non-null values
X.3 7 non-null values
X.4 7 non-null values
X.5 7 non-null values
X.6 7 non-null values
[...]
X.23 7 non-null values
X.24 7 non-null values
X.25 5 non-null values
X.26 3 non-null values
dtypes: float64(8), int64(10), object(8)
Because of the delimiter problem noted by @MRAB, it has some trouble with the last few columns:
In [73]: data.ix[:,20:]
Out[73]:
X.21 X.22 X.23 X.24 X.25 X.26
0 315 0.95 ABC transporter transmembrane region
1 527 0.93 ABC transporter None None
2 408 0.86 RecF/RecN/SMC N terminal domain
3 575 0.85 RecF/RecN/SMC N terminal domain
4 556 0.72 AAA ATPase domain None
5 275 0.85 YceG-like family None None
6 200 0.85 Pyridine nucleotide-disulphide oxidoreductase None
but that can be patched up at the end.
How to make read_csv more flexibile with numbers and whitespaces
Id try it like this.
Id manipulate the text file before I attempt to parse it to a dataframe as follows:
import pandas as pd
import re
f = open("whitespace.txt", "r")
g = f.read().replace("\n", " ")
prepared_text = re.sub(r'(\d+,\d+)', r'\1@', g)
df = pd.DataFrame({'My columns':prepared_text.split('@')})
print(df)
This gives the following:
My columns
0 I am an example of a dataframe I have Problems...
1 So How can I read it 20,00
2
I guess this'd suffice as long as the input file wasnt too large but using the re module and substitiution gives you the control you seek.
The (\d+,\d+) parentheses mark a group which we want to match. We're basically matching any of your numbers in your text file.
Then we use the \1 which is called a backreference to the matched group which is referred to when specifying a replacement. So \d+,\d+ is replaced by \d+,\d+@.
Then we use the inserted character as a delimiter.
There are some good examples here:
https://lzone.de/examples/Python%20re.sub
How to read file with space separated values in pandas
add delim_whitespace=True
argument, it's faster than regex.
Read Space-separated Data with Pandas
Your original line:
pd.read_csv(filename, sep=' ',header=None)
was specifying the separator as a single space, because your csvs can have spaces or tabs you can pass a regular expression to the sep
param like so:
pd.read_csv(filename, sep='\s+',header=None)
This defines separator as being one single white space or more, there is a handy cheatsheet that lists regular expressions.
Related Topics
How to Find Tag with Particular Text with Beautiful Soup
Python Udisks - Enumerating Device Information
Placing Custom Images in a Plot Window--As Custom Data Markers or to Annotate Those Markers
Unboundlocalerror with Nested Function Scopes
Replace All Elements of Python Numpy Array That Are Greater Than Some Value
Positional Argument V.S. Keyword Argument
Python Nltk Pos_Tag Not Returning the Correct Part-Of-Speech Tag
Convert HTML Entities to Unicode and Vice Versa
Connect Wifi with Python or Linux Terminal
Loading .Rdata Files into Python