Create Pandas Dataframe from Txt File with Specific Pattern

Create Pandas DataFrame from txt file with specific pattern

You can first read_csv with parameter name for create DataFrame with column Region Name, separator is value which is NOT in values (like ;):

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])

Then insert new column State with extract rows where text [edit] and replace all values from ( to the end to column Region Name.

df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df['Region Name'] = df['Region Name'].str.replace(r' \(.+$', '')

Last remove rows where text [edit] by boolean indexing, mask is created by str.contains:

df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
State Region Name
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson

If need all values solution is easier:

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
State Region Name
0 Alabama Auburn (Auburn University)[1]
1 Alabama Florence (University of North Alabama)
2 Alabama Jacksonville (Jacksonville State University)[2]
3 Alabama Livingston (University of West Alabama)[2]
4 Alabama Montevallo (University of Montevallo)[2]
5 Alabama Troy (Troy University)[2]
6 Alabama Tuscaloosa (University of Alabama, Stillman Co...
7 Alabama Tuskegee (Tuskegee University)[5]
8 Alaska Fairbanks (University of Alaska Fairbanks)[2]
9 Arizona Flagstaff (Northern Arizona University)[6]
10 Arizona Tempe (Arizona State University)
11 Arizona Tucson (University of Arizona)

Python Text File to Data Frame with Specific Pattern

This can be done without regex with list comprehension and splitting strings:

import pandas as pd

text = '''Number 01600 London Register 4314

Some random text...

1 SHARE: 73/1284
John Smith
BORN: 1960-01-01 ADR: Streetname 3/2 1000
f 4222/2001
h 1334/2000
i 5774/2000
4 SHARE: 58/1284
Boris Morgan
BORN: 1965-01-01 ADR: Streetname 4 2000
c 4222/1988
f 4222/2000'''

text = [i.strip() for i in text.splitlines()] # create a list of lines

data = []

# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]

# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]

for i in items:
d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1], }
items = list(s.split() for s in i[3:])
merged_items = []

for i in items:
if len(i[0]) == 1 and i[0].isalpha():
merged_items.append(i)
else:
merged_items[-1][-1] = merged_items[-1][-1] + i[0]
d.update({name: value for name,value in merged_items})
data.append(d)

#load the list of dicts as a dataframe
df = pd.DataFrame(data)

Output:

















































NumberRegisterCityIdShareNameBornfhic
0016004314London173/1284John Smith1960-01-014222/20011334/20005774/2000nan
1016004314London458/1284Boris Morgan1965-01-014222/2000nannan4222/1988

Python Text to Data Frame with Specific Pattern

I wouldn't use the same variable i for both inner and outer loops. Changing your for loop to the following should work cleaner:

for i in items:
d = {'Number': number,
'Register': register,
'City': city,
'Id': int(i[0].split()[0]),
'Share': i[0].split(': ')[1],
'Name': i[1],
}

if "ADDR" in i[2]:
born, address = i[2].split("ADDR:")
d['Born'] = born.replace("BORN:", "").strip()
d['Address'] = address.strip()
else:
d['Born']: i[2].split()[1]

if len(i)>3:
for j in i[3:]:
key, value = j.split(" ", 1)
d[key] = value
data.append(d)

#load the list of dicts as a dataframe
df = pd.DataFrame(data)

How to create a dataframe from .txt file with headers mixed in the rows, without iteration?

IIUC:

from io import StringIO
txt = StringIO("""Bob Sales
12
33
45
Sam Sales
23
Wendy Sales
12
33
45
64
54""")

df = pd.read_csv(txt, header=None, sep='\s\s+', engine='python')

df[1] = df[0].str.extract('([a-zA-Z ]+)').ffill()
df_out = df[df[0] != df[1]]
print(df_out)

Output:

     0            1
1 12 Bob Sales
2 33 Bob Sales
3 45 Bob Sales
5 23 Sam Sales
7 12 Wendy Sales
8 33 Wendy Sales
9 45 Wendy Sales
10 64 Wendy Sales
11 54 Wendy Sales

Details: using regex look for a pattern create a new column with only rows with that pattern, use ffill to replicate the previous value down in that new column. Next, filter the dataframe where the original column is not equal to the new column.

Load data from txt with pandas

You can do as:

import pandas as pd
df = pd.read_csv('file_location\filename.txt', delimiter = "\t")

(like, df = pd.read_csv('F:\Desktop\ds\text.txt', delimiter = "\t")

How to create a dataframe from text file having single column

Try this:

import numpy as np
import pandas as pd

x = np.loadtxt('test1.txt',delimiter = '\n\n', dtype=str)
reshaped = x.reshape(-1,5).T
df = pd.DataFrame(data = reshaped[1:,:], columns = reshaped[0])

print(df)

OR

def parseFile(filename, vals_per_col):
with open('test1.txt','r') as f:
lines = [line.strip() for line in f if line.strip()]
return {lines[i]:lines[i+1 : i+5] for i in range(0,len(lines),vals_per_col+1)}

df = pd.DataFrame(parseFile('sample.txt',4))
print(df)

Output:

  SlNo Student Name   Grade      Subject Marks Obtained Percentage
0 1 A First English 50 10%
1 2 B Second Mathematics 65 20%
2 3 C Third Science 55 30%
3 4 D Fourth Physics 70 40%


Related Topics



Leave a reply



Submit