Convert a Text File into Columns

Converting Text file to CSV with Columns

pandas has a method to read a fixed-width text file. There are additional parameters to indicate the width of columns if its default to infer columns isn't correct, but in this case it works:

import pandas as pd

df = pd.read_fwf('input.txt', header=None)
df.to_csv('output.csv', index=False, header=['Number','Address','In_Out'])

output.csv:

Number,Address,In_Out
85350,5211 APPLE LN,O
85805,201 ORANGE ST,I
84412,1313 BANANA RD,I

I want to convert the following text file into one having the headers as columns and the corresponding data in rows

I have made some assumptions as the data you have provided seems like a list of JSON data. This is because there is a "," in between your entries.

Here is my input.json file. Notice I have added [ and ] at the top and bottom, as this would give your data a proper JSON structure.

[
{
"campaignId": "all",
"startDate": "2020-06-11",
"endDate": "2020-06-11",
"device": "Computers",
"network": "Display Network",
"channel": "all",
"accLevelQS": -1.0,
"impressions": 389,
"clicks": 3,
"ctr": 0.0,
"avgCPC": 0.0,
"convValuePerClick": 0.0,
"convValuePerCost": 0.0,
"costConv1PerClick": 0.0,
"convRate1PerClick": 0.0,
"cost": 0.142884,
"conv1PerClick": 0.0,
"totalConvValue": 0.00,
"allConversions": 0.0,
"allConversionValue": 0.00,
"avgPosition": 0.0,
"intr": 3,
"searchImprShare": 0.0,
"contImprShare": 5.0,
"impressionShare": 5.0
},
{
"campaignId": "all",
"startDate": "2020-06-11",
"endDate": "2020-06-11",
"device": "Mobile devices with full browsers",
"network": "Display Network",
"channel": "all",
"accLevelQS": -1.0,
"impressions": 6101,
"clicks": 90,
"ctr": 0.0,
"avgCPC": 0.0,
"convValuePerClick": 0.0,
"convValuePerCost": 0.0,
"costConv1PerClick": 0.0,
"convRate1PerClick": 0.0,
"cost": 4.342799,
"conv1PerClick": 0.0,
"totalConvValue": 0.00,
"allConversions": 0.0,
"allConversionValue": 0.00,
"avgPosition": 0.0,
"intr": 90,
"searchImprShare": 0.0,
"contImprShare": 5.0077566465021217,
"impressionShare": 5.0077566465021217
}
]

Here is the code which uses the pandas library to process the data into a dataframe and then write it to a CSV file.

import json  # Available by default, no install required.
import glob # Available by default, no install required.
import pandas as pd # Requires installation via pip.

# Initialise a list to store our results.
combined_json = []

# Set a glob pattern to *.txt since your files are txt files.
# You can also write the full path e.g. /home/user/textfiles/*.txt
text_files = glob.glob("*.txt")

# Loop through all the text files and combine them into a single JSON list.
# As for 70,000 files, I am unsure how the performance will turn out.
for json_text in text_files:
with open(json_text, 'r') as text_file:
combined_json.extend(json.load(text_file))

# Write all the files to a JSON file. For your future usage.
# You also can read directly from the combed_json variable.
with open('input.json', 'w') as json_file:
json.dump(combined_json, json_file, indent=2)

# Convert the JSON data into a dataframe, using the combined_json variable.
json_df = pd.json_normalize(combined_json)

# Write the data from the dataframe to the CSV file.
# Mode "w" will always overwrite the CSV file, use mode "a" to append text instead of overwriting.
json_df.to_csv("dataframe.csv", mode="w")

For more information on how pd.json_normalize works, you can refer here.

To get started with the pandas library, you can refer here.

If you wish to load the JSON from a string instead of a file, you can refer here.

To learn more about glob, you can refer here.

Convert text file data into a dataframe

multiple text files each file having data like this. Urdu, English-in-brackets

So start with a function that reads a single file of that type:

def read_single_file(filename: str) -> tuple[str, str]:
urdu = ""
english = ""
with open(filename) as f:
for line in f:
line = line.strip() # remove newlines etc.
if not line: # ignore empty lines
continue
if line.startswith("["):
english = line.strip("[]")
else:
urdu = line
return (urdu, english)

Then, loop over your files; I'll assume they're just *.txt:

import glob

results = [read_single_file(filename) for filename in glob.glob("*.txt")]

Now that you have a list of 2-tuples, you can just create a dataframe out of it:

import pandas as pd

df = pd.DataFrame(results, columns=["urdu", "english"])

Save columns of data from a text file as separate files

all output files open at the same time

One GNU awk idea:

awk '{for (i=1;i<=NF;i++) print $i > i".out"}' file

NOTES:

  • this will open, and keep open, a file descriptor for each output file
  • many awk implementations have a limit on the number of files they can have open at a time; opening and closing files is time consuming so from a performance perspective you will want to limit the number of open and close operations
  • GNU awk has a fairly high limit on how many files can be opened at one time
  • if you have GNU awk and you receive an error stating something about too many open file descriptors then let us know and we can look at another idea (eg: running a separate awk for each set of N columns; using an in-memory solution - assuming the whole file can fit in memory)
  • you mention columns are separated by tab spaces; (not sure what you mean ... columns are separated by multiple tabs and spaces? columns are separated by tabs or spaces?) this answer uses awk's default field delimiter of 'white space' (multiple spaces/tabs treated as a single delimiter); if your fields are delimited by tabs, but include spaces within the fields, then change awk '{for ... to awk -F'\t' '{for ...

in-memory; one output file open at a time; vanilla awk

Assuming the input file can fit into memory:

One idea that should work for all awk flavors:

awk '
{ for (i=1;i<=NF;i++)
cols[i]=cols[i] (FNR==1 ? "" : ORS) $i
}
END { for (i=1;i<=NF;i++) {
outfile= i ".out"
print cols[i] > outfile
close(outfile)
}
}
' file

in-memory; one output file open at a time; GNU awk

Another in-memory solution using GNU awk (for multi-dimensional array support):

awk '
{ for(i=1;i<=NF;i++)
cols[i][FNR] = $i
}
END { for (i=1;i<=NF;i++) {
outfile= i ".out"
for (j=1;j<=FNR;j++)
print cols[i][j] > outfile
close(outfile)
}
}
' file

All 3 of these answers generate:

$ head ?.out
==> 1.out <==
abcd
efgh
ijkl
mnop
qrst

==> 2.out <==
abcd
efgh
ijkl
mnop
qrst

==> 3.out <==
abcd
efgh
ijkl
mnop
qrst

Performance review

Setup:

# create a file with 5000 lines and 500 columns; ~19.5 MBytes

awk '
BEGIN { for (i=1;i<=5000;i++) {
printf "%s", "col_1"
for (j=2;j<=500;j++)
printf "\t%s", "col_" j
print ""
}
}
' > multi_column.txt

2.5 million open/close operations

Running either of the 2x answers that open/close each of 500 output files, for each of 5000 input lines, (ie, 5000 x 500 = 2.5 million open/close operations):

  • killed after 2 minutes and 800 lines processed
  • extrapolating: ~12.5 minutes to process 5000 lines
  • time will (obviously) vary depending on hardware (eg, Ed Morton reports his answer takes 10 minutes on his laptop)

all (500) output files open at the same time

Running the 1st answer (above):

  • 10 seconds to generate 500 files with 5000 lines each
  • even if we had to limit ourselves to, say, processing 20 columns at a time ... we could make 25 passes of the input file and still complete in < 7 minutes (time could be further reduced by running some of the awk sessions in parallel)

in-memory; one output file open at a time; vanilla awk

Running the 2nd answer (above)

  • 6 seconds to generate 500 files with 5000 lines each

in-memory; one output file open at a time; GNU awk

Running the 3rd answer (above):

  • 3 seconds to generate 500 files with 5000 lines each
  • previous in-memory answer is slower due to the time required to 'find and append' the new field onto the end of the ever-increasing-in-length array entry (cols[i]=cols[i] (FNR==1 ? "" : ORS) $i)


Related Topics



Leave a reply



Submit