Create Pandas Dataframe from Txt File with Specific Pattern

Create Pandas DataFrame from txt file with specific pattern

You can first read_csv with parameter name for create DataFrame with column Region Name, separator is value which is NOT in values (like ;):

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])

Then insert new column State with extract rows where text [edit] and replace all values from ( to the end to column Region Name.

df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df['Region Name'] = df['Region Name'].str.replace(r' \(.+$', '')

Last remove rows where text [edit] by boolean indexing, mask is created by str.contains:

df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
      State   Region Name
0   Alabama        Auburn
1   Alabama      Florence
2   Alabama  Jacksonville
3   Alabama    Livingston
4   Alabama    Montevallo
5   Alabama          Troy
6   Alabama    Tuscaloosa
7   Alabama      Tuskegee
8    Alaska     Fairbanks
9   Arizona     Flagstaff
10  Arizona         Tempe
11  Arizona        Tucson

If need all values solution is easier:

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
      State                                        Region Name
0   Alabama                      Auburn (Auburn University)[1]
1   Alabama             Florence (University of North Alabama)
2   Alabama    Jacksonville (Jacksonville State University)[2]
3   Alabama         Livingston (University of West Alabama)[2]
4   Alabama           Montevallo (University of Montevallo)[2]
5   Alabama                          Troy (Troy University)[2]
6   Alabama  Tuscaloosa (University of Alabama, Stillman Co...
7   Alabama                  Tuskegee (Tuskegee University)[5]
8    Alaska      Fairbanks (University of Alaska Fairbanks)[2]
9   Arizona         Flagstaff (Northern Arizona University)[6]
10  Arizona                   Tempe (Arizona State University)
11  Arizona                     Tucson (University of Arizona)

Python Text File to Data Frame with Specific Pattern

This can be done without regex with list comprehension and splitting strings:

import pandas as pd

text = '''Number 01600 London                           Register  4314

Some random text...

 1 SHARE: 73/1284
   John Smith
   BORN: 1960-01-01 ADR: Streetname 3/2   1000
   f 4222/2001
   h 1334/2000
   i 5774/2000
 4 SHARE: 58/1284
   Boris Morgan
   BORN: 1965-01-01 ADR: Streetname 4   2000
   c 4222/1988
   f 4222/2000'''

text = [i.strip() for i in text.splitlines()] # create a list of lines

data = []

# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]

# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]

for i in items:
    d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1], }
    items = list(s.split() for s in i[3:])
    merged_items = []

    for i in items:
        if len(i[0]) == 1 and i[0].isalpha():
            merged_items.append(i)
        else:
            merged_items[-1][-1] = merged_items[-1][-1] + i[0]
    d.update({name: value for name,value in merged_items})
    data.append(d)

#load the list of dicts as a dataframe
df = pd.DataFrame(data)

Output:

	Number	Register	City	Id	Share	Name	Born	f	h	i	c
0	01600	4314	London	1	73/1284	John Smith	1960-01-01	4222/2001	1334/2000	5774/2000	nan
1	01600	4314	London	4	58/1284	Boris Morgan	1965-01-01	4222/2000	nan	nan	4222/1988

Python Text to Data Frame with Specific Pattern

I wouldn't use the same variable i for both inner and outer loops. Changing your for loop to the following should work cleaner:

for i in items:
    d = {'Number': number, 
         'Register': register, 
         'City': city, 
         'Id': int(i[0].split()[0]), 
         'Share': i[0].split(': ')[1], 
         'Name': i[1], 
         }
    
    if "ADDR" in i[2]:
        born, address = i[2].split("ADDR:")
        d['Born'] = born.replace("BORN:", "").strip()
        d['Address'] = address.strip()
    else:
        d['Born']: i[2].split()[1]
    
    if len(i)>3:
        for j in i[3:]:
            key, value = j.split(" ", 1)
            d[key] = value
    data.append(d)

#load the list of dicts as a dataframe
df = pd.DataFrame(data)

How to create a dataframe from .txt file with headers mixed in the rows, without iteration?

IIUC:

from io import StringIO
txt = StringIO("""Bob Sales
12
33
45
Sam Sales
23
Wendy Sales
12
33
45
64
54""")

df = pd.read_csv(txt, header=None, sep='\s\s+', engine='python')

df[1] = df[0].str.extract('([a-zA-Z ]+)').ffill()
df_out = df[df[0] != df[1]]
print(df_out)

Output:

     0            1
1   12    Bob Sales
2   33    Bob Sales
3   45    Bob Sales
5   23    Sam Sales
7   12  Wendy Sales
8   33  Wendy Sales
9   45  Wendy Sales
10  64  Wendy Sales
11  54  Wendy Sales

Details: using regex look for a pattern create a new column with only rows with that pattern, use ffill to replicate the previous value down in that new column. Next, filter the dataframe where the original column is not equal to the new column.

Load data from txt with pandas

You can do as:

import pandas as pd
df = pd.read_csv('file_location\filename.txt', delimiter = "\t")

(like, df = pd.read_csv('F:\Desktop\ds\text.txt', delimiter = "\t")

How to create a dataframe from text file having single column

Try this:

import numpy as np
import pandas as pd

x = np.loadtxt('test1.txt',delimiter = '\n\n', dtype=str)
reshaped = x.reshape(-1,5).T
df = pd.DataFrame(data = reshaped[1:,:], columns = reshaped[0])

print(df)

def parseFile(filename, vals_per_col):
    with open('test1.txt','r') as f:
        lines = [line.strip() for line in f if line.strip()]
    return {lines[i]:lines[i+1 : i+5] for i in range(0,len(lines),vals_per_col+1)}

df = pd.DataFrame(parseFile('sample.txt',4))
print(df)

Output:

  SlNo Student Name   Grade      Subject Marks Obtained Percentage
0    1            A   First      English             50        10%
1    2            B  Second  Mathematics             65        20%
2    3            C   Third      Science             55        30%
3    4            D  Fourth      Physics             70        40%

Create Pandas Dataframe from Txt File with Specific Pattern