Create Pandas DataFrame from txt file with specific pattern
You can first read_csv
with parameter name
for create DataFrame
with column Region Name
, separator is value which is NOT in values (like ;
):
df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
Then insert
new column State
with extract
rows where text [edit]
and replace
all values from (
to the end to column Region Name
.
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df['Region Name'] = df['Region Name'].str.replace(r' \(.+$', '')
Last remove rows where text [edit]
by boolean indexing
, mask is created by str.contains
:
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
State Region Name
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson
If need all values solution is easier:
df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
State Region Name
0 Alabama Auburn (Auburn University)[1]
1 Alabama Florence (University of North Alabama)
2 Alabama Jacksonville (Jacksonville State University)[2]
3 Alabama Livingston (University of West Alabama)[2]
4 Alabama Montevallo (University of Montevallo)[2]
5 Alabama Troy (Troy University)[2]
6 Alabama Tuscaloosa (University of Alabama, Stillman Co...
7 Alabama Tuskegee (Tuskegee University)[5]
8 Alaska Fairbanks (University of Alaska Fairbanks)[2]
9 Arizona Flagstaff (Northern Arizona University)[6]
10 Arizona Tempe (Arizona State University)
11 Arizona Tucson (University of Arizona)
Python Text File to Data Frame with Specific Pattern
This can be done without regex with list comprehension and splitting strings:
import pandas as pd
text = '''Number 01600 London Register 4314
Some random text...
1 SHARE: 73/1284
John Smith
BORN: 1960-01-01 ADR: Streetname 3/2 1000
f 4222/2001
h 1334/2000
i 5774/2000
4 SHARE: 58/1284
Boris Morgan
BORN: 1965-01-01 ADR: Streetname 4 2000
c 4222/1988
f 4222/2000'''
text = [i.strip() for i in text.splitlines()] # create a list of lines
data = []
# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]
# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]
for i in items:
d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1], }
items = list(s.split() for s in i[3:])
merged_items = []
for i in items:
if len(i[0]) == 1 and i[0].isalpha():
merged_items.append(i)
else:
merged_items[-1][-1] = merged_items[-1][-1] + i[0]
d.update({name: value for name,value in merged_items})
data.append(d)
#load the list of dicts as a dataframe
df = pd.DataFrame(data)
Output:
Number | Register | City | Id | Share | Name | Born | f | h | i | c | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01600 | 4314 | London | 1 | 73/1284 | John Smith | 1960-01-01 | 4222/2001 | 1334/2000 | 5774/2000 | nan |
1 | 01600 | 4314 | London | 4 | 58/1284 | Boris Morgan | 1965-01-01 | 4222/2000 | nan | nan | 4222/1988 |
Python Text to Data Frame with Specific Pattern
I wouldn't use the same variable i
for both inner and outer loops. Changing your for
loop to the following should work cleaner:
for i in items:
d = {'Number': number,
'Register': register,
'City': city,
'Id': int(i[0].split()[0]),
'Share': i[0].split(': ')[1],
'Name': i[1],
}
if "ADDR" in i[2]:
born, address = i[2].split("ADDR:")
d['Born'] = born.replace("BORN:", "").strip()
d['Address'] = address.strip()
else:
d['Born']: i[2].split()[1]
if len(i)>3:
for j in i[3:]:
key, value = j.split(" ", 1)
d[key] = value
data.append(d)
#load the list of dicts as a dataframe
df = pd.DataFrame(data)
How to create a dataframe from .txt file with headers mixed in the rows, without iteration?
IIUC:
from io import StringIO
txt = StringIO("""Bob Sales
12
33
45
Sam Sales
23
Wendy Sales
12
33
45
64
54""")
df = pd.read_csv(txt, header=None, sep='\s\s+', engine='python')
df[1] = df[0].str.extract('([a-zA-Z ]+)').ffill()
df_out = df[df[0] != df[1]]
print(df_out)
Output:
0 1
1 12 Bob Sales
2 33 Bob Sales
3 45 Bob Sales
5 23 Sam Sales
7 12 Wendy Sales
8 33 Wendy Sales
9 45 Wendy Sales
10 64 Wendy Sales
11 54 Wendy Sales
Details: using regex look for a pattern create a new column with only rows with that pattern, use ffill to replicate the previous value down in that new column. Next, filter the dataframe where the original column is not equal to the new column.
Load data from txt with pandas
You can do as:
import pandas as pd
df = pd.read_csv('file_location\filename.txt', delimiter = "\t")
(like, df = pd.read_csv('F:\Desktop\ds\text.txt', delimiter = "\t")
How to create a dataframe from text file having single column
Try this:
import numpy as np
import pandas as pd
x = np.loadtxt('test1.txt',delimiter = '\n\n', dtype=str)
reshaped = x.reshape(-1,5).T
df = pd.DataFrame(data = reshaped[1:,:], columns = reshaped[0])
print(df)
OR
def parseFile(filename, vals_per_col):
with open('test1.txt','r') as f:
lines = [line.strip() for line in f if line.strip()]
return {lines[i]:lines[i+1 : i+5] for i in range(0,len(lines),vals_per_col+1)}
df = pd.DataFrame(parseFile('sample.txt',4))
print(df)
Output:
SlNo Student Name Grade Subject Marks Obtained Percentage
0 1 A First English 50 10%
1 2 B Second Mathematics 65 20%
2 3 C Third Science 55 30%
3 4 D Fourth Physics 70 40%
Related Topics
Reading a Utf8 CSV File with Python
Using Lambda Expression to Connect Slots in Pyqt
Elegant Python Code for Integer Partitioning
Very Large Matrices Using Python and Numpy
How to Check If One of the Following Items Is in a List
Rename Multiple Files in a Directory in Python
Writing a Connection String When Password Contains Special Characters
Python: Bind an Unbound Method
Sum a List of Numbers in Python
Passing an Integer by Reference in Python
How to Bind Self Events in Tkinter Text Widget After It Will Binded by Text Widget
Making Object JSON Serializable with Regular Encoder
How to Have Clusters of Stacked Bars
How Are Post and Get Variables Handled in Python
How to Get Week Number in Python
How to Get the Path of the Current Executed File in Python
How to Split a String of Space Separated Numbers into Integers