Import Text to Pandas with Multiple Delimiters

import text to pandas with multiple delimiters

One way might be to use the regex separators permitted by the python engine. For example:

>>> !cat castle.dat
c stuff
c more header
c begin data
1 1:.5
1 2:6.5
1 3:5.3
>>> df = pd.read_csv('castle.dat', skiprows=3, names=['a', 'b', 'c'],
sep=' |:', engine='python')
>>> df
a b c
0 1 1 0.5
1 1 2 6.5
2 1 3 5.3

Import .txt to Pandas Dataframe With Multiple Delimiters

You can start with setting names on you existing columns, and then applying regex on data while creating the new columns.

In order to fix the "single space delimiter" issue in your output, you can define "at least 2 space characters" eg [\s]{2,} as delimiter which would fix the issue for St. Elf in City names

An example :

import pandas as pd 
import re

df = pd.read_csv(
'test.txt',
sep = '[\s]{2,}',
engine = 'python',
header = None,
index_col = False,
names= [
"FirstN","LastN","FULLSID","TeacherData","TeacherLastN"
]
)
sid_pattern = re.compile(r'(\d{9})(\d+-\d+-\d+)(.*)', re.IGNORECASE)
df['SID'] = df.apply(lambda row: sid_pattern.search(row.FULLSID).group(1), axis = 1)
df['Birth'] = df.apply(lambda row: sid_pattern.search(row.FULLSID).group(2), axis = 1)
df['City'] = df.apply(lambda row: sid_pattern.search(row.FULLSID).group(3), axis = 1)

teacherdata_pattern = re.compile(r'(.{2})([\dA-Z]+\d)(.*)', re.IGNORECASE)
df['States'] = df.apply(lambda row: teacherdata_pattern.search(row.TeacherData).group(1), axis = 1)
df['Postal'] = df.apply(lambda row: teacherdata_pattern.search(row.TeacherData).group(2)[-4:], axis = 1)
df['TeacherFirstN'] = df.apply(lambda row: teacherdata_pattern.search(row.TeacherData).group(3), axis = 1)

del df['FULLSID']
del df['TeacherData']

print(df)

Output :

  FirstN  LastN TeacherLastN        SID       Birth        City States Postal TeacherFirstN
0 Ann Gosh Ryan 123456789 2008-12-15 Irvine CA A9Z5 Steve
1 Yosh Dave Tuck 987654321 2009-04-18 St. Elf NY P8G0 Brad
2 Clair Simon John 324567457 2008-12-29 New Jersey NJ R9B3 Dan

pandas read_csv() for multiple delimiters

From this question, Handling Variable Number of Columns with Pandas - Python, one workaround to pandas.errors.ParserError: Expected 29 fields in line 11, saw 45. is let read_csv know about how many columns in advance.

my_cols = [str(i) for i in range(45)] # create some col names
df_user_key_word_org = pd.read_csv(filepath+"user_key_word.txt",
sep="\s+|;|:",
names=my_cols,
header=None,
engine="python")
# I tested with s = StringIO(text_from_OP) on my computer

Sample Image

Hope this works.

Convert text file into dataframe with custom multiple delimiter in python

It's tricky to know exactly what are your rules for splitting. You can use a regex as delimiter.

Here is a working example to split the lists and date as columns, but you'll probably have to tweak it to your exact rules:

df = pd.read_csv('output.txt', sep=r'(?:,\s*|^)(?:\d+: \d+x\d+|Done[^)]+\)\s*)',
header=None, engine='python', names=(None, 'a', 'b', 'date')).iloc[:, 1:]

output:

                                      a                     b                    date
0 2 persons, 1 cat, 1 clock 2 persons, 1 chair Tue, 05 April 03:54:02
1 3 persons, 1 cat, 1 laptop, 1 clock 4 persons, 2 chairs Tue, 05 April 03:54:05
2 3 persons, 1 chair 4 persons, 2 chairs Tue, 05 April 03:54:07

How to read txt file in pandas with multiple delimiters?

The \s+ delimiter would work :

df = pd.read_csv(os.path.join(maindir, 'EDMA_1_rcp26_2025_1_output.rsv'),\
skiprows = 9, delimiter = r'\s+', header = None)

Pretty simple, actually.

Convert text file containing multiple delimiters to CSV

Your regex needs a tweak, `r"[ \t]+" selects any length of spaces and tabs (1 or greater). Additionally, pandas uses the first line of the file to determine how many columns there are. Your example starts with 4 columns and then adds another later on. That's too late - pandas has already created 4 element rows. You can solve that by supplying your own column names, letting pandas know how many there really are. In this example I'm just using integers but you could give them more useful names.

df = pd.read_csv('Water level.txt' ,  sep=r'[ \t]', encoding='GBK',
engine='python', names=range(5))

Importing CSV file with Multiple Delimiters in Python

I tried the file you provided, and it was actually giving me an encoding error.

Try the following encoding:

pd.read_csv('ses_awards.csv', encoding = 'ISO-8859-1')

Parsing txt file with multiple delimiters

A crude "solution" (which assumes the datafile is perfectly formatted):

with open('matrix.dat', 'r') as data_file:
rows, cols = [int(c) for c in data_file.readline().split() if c.isnumeric()]
array = np.fromstring(data_file.read(), sep=' ').reshape(rows, cols)

And here's a probably unnecessary alternative which avoids reading the entire file as a single string:

import itertools
chainstar = itertools.chain.from_iterable
with open('matrix.dat', 'r') as data_file:
rows, cols = [int(c)
for c in data_file.readline().split()
if c.isnumeric()]
array = np.fromiter(chainstar(map(lambda s:s.split(), data_file)),
dtype=np.float,
count=rows*cols).reshape(rows, cols)


Related Topics



Leave a reply



Submit