how to merge 200 csv files in Python
As ghostdog74 said, but this time with headers:
with open("out.csv", "ab") as fout:
# first file:
with open("sh1.csv", "rb") as f:
fout.writelines(f)
# now the rest:
for num in range(2, 201):
with open("sh"+str(num)+".csv", "rb") as f:
next(f) # skip the header, portably
fout.writelines(f)
python script to merge more than 200 very large csv very in just one
You probably just need to keep a merged.csv
file open whilst reading in each of the certificates.csv
files. glob.glob()
can be used to recursively find all suitable files:
import glob
import csv
import os
path = r'C:\path\to\folder\where\all\files\are-allowated-in-subfolders'
os.chdir(path)
with open('merged.csv', 'w', newline='') as f_merged:
csv_merged = csv.writer(f_merged)
for filename in glob.glob(os.path.join(path, '*/certificates.csv'), recursive=True):
print(filename)
try:
with open(filename) as f_csv:
csv_merged.writerows(csv.reader(f_csv))
except:
print('problem with file: ', filename)
An r
prefix can be added to your path to avoid needing to escape each backslash. Also newline=''
should be added to the open()
when using a csv.writer()
to stop extra blank lines being written to your output file.
Loop to merge multiple csv files
You were almost there. First of all, you are not actually renaming because you missed file=
in front of the rename.
Then, to add a column to a dataframe, you simply do df[col]=file[col]
.
Therefore:
df = pd.DataFrame()
for i, f in enumerate(files):
file = pd.read_csv(f)
file = file.rename(columns = {'Damage': '{}sec'.format(i)})
df['{}sec'.format(i)] = file['{}sec'.format(i)]
Don't forget to add the id column once before iterating.
Reading multiple CSV files and merge Python Pandas
You can use pd.concat
and a list comprehension:
df = pd.concat([pd.read_csv(csv_name, sep=';', header=None) for csv_name in csv_names])
How to merge multiple text files into one csv file in Python
The problem is that your os.listdir
gives you the list of filenames inside dirpath
, not the full path to these files. You can get the full path by prepending the dirpath
to filenames with os.path.join
function.
import os
import pandas as pd
dirpath = 'C:\Files\Code\Analysis\Input\qobs_RR1\\'
output = 'C:\Files\Code\Analysis\output\qobs_CSV.csv'
csvout_lst = []
files = [os.path.join(dirpath, fname) for fname in os.listdir(dirpath)]
for filename in sorted(files):
data = pd.read_csv(filename, sep=':', index_col=0, header=None)
csvout_lst.append(data)
pd.concat(csvout_lst).to_csv(output)
Edit: this can be done with a one-liner:
pd.concat(
pd.read_csv(os.path.join(dirpath, fname), sep=':', index_col=0, header=None)
for fname in sorted(os.listdir(dirpath))
).to_csv(output)
Edit 2: updated the answer, so the list of files is sorted alphabetically.
Python csv merge multiple files with different columns
As already explained in your original question, you can easily extend the columns in Awk if you know how many to expect.
awk -F ',' -v cols=5 'BEGIN { OFS=FS }
FNR == 1 && NR > 1 { next }
NF<cols { for (i=NF+1; i<=cols; ++i) $i = "" }
1' *.csv >file.csv
I slightly refactored this to skip the unwanted lines with next
rather than vice versa; this simplifies the rest of the script slightly. I also added the missing comma separator.
You can easily print the number of columns in each file, and just note the maximum:
awk -F , 'FNR==1 { print NF, FILENAME }' *.csv
If you don't know how many fields there are going to be in files you do not yet have, or if you need to cope with complex CSV with quoted fields, maybe switch to Python for this. It's not too hard to do the field number sniffing in Awk, but coping with quoting is tricky.
import csv
import sys
# Sniff just the first line from every file
fields = 0
for filename in sys.argv[1:]:
with open(filename) as raw:
for row in csv.reader(raw):
# If the line is longer than current max, update
if len(row) > fields:
fields = len(row)
titles = row
# Break after first line, skip to next file
break
# Now do the proper reading
writer = csv.writer(sys.stdout)
writer.writerow(titles)
for filename in sys.argv[1:]:
with open(filename) as raw:
for idx, row in enumerate(csv.reader(raw)):
if idx == 0:
next
row.extend([''] * (fields - len(row)))
writer.writerow(row)
This simply assumes that the additional fields go at the end. If the files could have extra columns between other columns, or columns in different order, you need a more complex solution (though not by much; the Python CSV DictReader
subclass could do most of the heavy lifting).
Demo: https://ideone.com/S998l4
If you wanted to do the same type of sniffing in Awk, you basically have to specify the names of the input files twice, or do some nontrivial processing in the BEGIN
block to read all the files before starting the main script.
Related Topics
Salt and Hash a Password in Python
How to Handle a Broken Pipe (Sigpipe) in Python
How to Decorate an Instance Method with a Decorator Class
Board-Drawing Code to Move an Oval
How to Change Dataframe Column Names in Pyspark
Combining Conda Environment.Yml with Pip Requirements.Txt
How to Overwrite/Print Over the Current Line in Windows Command Line
Is' Operator Behaves Differently When Comparing Strings with Spaces
Matplotlib Legend Markers Only Once
How to Parse a Website Using Selenium and Beautifulsoup in Python
Removing Unicode \U2026 Like Characters in a String in Python2.7
Numpy.Unique with Order Preserved
Why Is Tensorflow 2 Much Slower Than Tensorflow 1
Convert Image from Pil to Opencv Format