CSV in Python Adding an Extra Carriage Return, on Windows

CSV in Python adding an extra carriage return, on Windows

Python 3:

The official csv documentation recommends opening the file with newline='' on all platforms to disable universal newlines translation:

with open('output.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
...

The CSV writer terminates each line with the lineterminator of the dialect, which is '\r\n' for the default excel dialect on all platforms because that's what RFC 4180 recommends.



Python 2:

On Windows, always open your files in binary mode ("rb" or "wb"), before passing them to csv.reader or csv.writer.

Although the file is a text file, CSV is regarded a binary format by the libraries involved, with \r\n separating records. If that separator is written in text mode, the Python runtime replaces the \n with \r\n, hence the \r\r\n observed in the file.

See this previous answer.

Python: Reading a Windows generated csv with carriage return in column

You could first parse out the extra carriage return characters using a regular expression and then use a multi-character seperator for Pandas.

import pandas as pd
import io
import re
import csv

with open('e_carriagereturn_20220430.dat', newline='') as f_input:
data = re.sub('\x0d[^\x0a]', ' ', f_input.read())
df = pd.read_csv(io.StringIO(data), sep='\|-\|', quoting=csv.QUOTE_NONE, engine='python', header=None)

print(df)

Alternatively without regular expressions you could pre-parse the data as follows. Use using the newline='' mode to keep the newline characters. These can then be removed easily. Secondly use quoting=csv.QUOTE_NONE to disable quote processing. Lastly remove any columns seen with just -.

import pandas as pd
import io
import csv

rows = []

with open('e_carriagereturn_20220430.dat', newline='') as f_input:
data = f_input.read().replace('\x0d', '')
csv_input = csv.reader(io.StringIO(data), delimiter='|', quoting=csv.QUOTE_NONE)

for row in csv_input:
rows.append([value for value in row if value != '-'])

df = pd.DataFrame(rows)
print(df)

Both give output similar to:

            0  1                    2                    3                    4   5        6   7               8  9  10 11        12      13      14       15            16 17 18 19 20 21        22 23                   24             25 26 27      28
0 752296019 " 04/15/2022 00:00:00 04/28/2022 00:00:00 04/15/2022 00:00:00 13 0 0 A J J 0.0000 0.0000 0.0000 1.2300 123456.2700 J J -23.4500 N 04/19/2022 12:00:41 AEINSTEIN1 0.0000
1 752296020 " 03/31/2022 00:00:00 04/13/2022 00:00:00 03/31/2022 00:00:00 1 359542 12 318047.01 A J J 543.2100 0.0000 0.0000 32.1000 244680.4400 J J 543.2100 J 04/01/2022 12:44:42 PKDICK1 0.0000
2 752296032 ! 04/08/2022 00:00:00 04/22/2022 00:00:00 04/08/2022 00:00:00 2 222856 12 54321 A J N 26.8700 0.0000 0.0000 1.2800 38068.8800 J N J N 26.8700 J 04/06/2022 12:00:32 ABC003 0.0000
3 752296044 " 04/19/2022 00:00:00 05/02/2022 00:00:00 04/19/2022 00:00:00 2 222857 12 34877 D J N 6.7800 0.0000 0.0000 6.7800 122345.3500 J J 6.7800 J 04/19/2022 12:00:49 WGIBSON 0.0000
4 752296098 ! 04/17/2022 00:00:00 05/01/2022 00:00:00 04/17/2022 00:00:00 13 0 0 D N N 0.0000 0.0000 0.0000 8.7000 79689.4800 N N J N 0.0000 N 04/15/2022 12:24:58 ABC003 0.0000
5 431807560 " 04/12/2022 00:00:00 04/21/2022 00:00:00 04/12/2022 00:00:00 5 0 0 D J N 16.9600 0.0000 0.0000 0.8500 10919.6900 J J 16.7800 N 04/13/2022 14:49:44 FHERBERT 0.0000
6 431807563 ! 04/17/2022 00:00:00 05/01/2022 00:00:00 04/17/2022 00:00:00 11 0 0 D N N 0.0000 0.0000 0.0000 2.6700 31790.1600 N N J N 0.0000 N 04/15/2022 12:44:56 ABC003 0.0000
7 431807594 " 03/28/2022 00:00:00 04/11/2022 00:00:00 03/28/2022 00:00:00 1 580807 12 12345AB12345AB D J J 193.8200 0.0000 0.0000 19.3800 276921.4800 J J 193.8200 J 03/29/2022 12:00:38 WGIBSON 0.0000
8 431807597 " 04/19/2022 00:00:00 05/02/2022 00:00:00 04/19/2022 00:00:00 1 107348 12 12.45671/AB D J J 6.7800 0.0000 0.0000 6.7800 87133.8200 J J 6.7800 J 04/15/2022 12:22:35 UKLEGUIN 0.0000
9 679785779 " 03/18/2022 00:00:00 04/01/2022 00:00:00 03/18/2022 00:00:00 13 0 0 B N N 0.0000 0.0000 0.0000 9.3300 142940.7700 N N J N 0.0000 N 04/20/2022 08:04:02 AHUXLEY 0.0000
10 679785789 ! 04/15/2022 00:00:00 04/29/2022 00:00:00 04/15/2022 00:00:00 2 4876321 12 488250/CD D J N 876.5800 0.0000 0.0000 16.7800 200604.8900 J N J N 876.5400 J 04/13/2022 12:28:49 ABC003 0.0000
11 665661904 ! 04/15/2022 00:00:00 04/29/2022 00:00:00 04/15/2022 00:00:00 2 394132 12 46409 EF D J N 567.9800 0.0000 0.0000 9.1600 513561.4600 J N J N 567.8700 J 04/13/2022 12:24:37 ABC003 0.0000
12 665661909 " 03/25/2022 00:00:00 04/01/2022 00:00:00 03/25/2022 00:00:00 14 216308 12 97745894XY D J J 0.0000 0.0000 0.0000 11.4500 208666.1300 J J 0.0000 J 03/25/2022 12:25:03 FHERBERT 0.0000
13 665661934 ! 04/19/2022 00:00:00 05/02/2022 00:00:00 04/19/2022 00:00:00 2 627911 12 abc/21.4177 D J N 54.3200 0.0000 0.0000 23.4500 333689.0000 J N J N 54.3200 J 04/14/2022 23:15:20 ABC003 0.0000
14 665661945 ! 03/25/2022 00:00:00 04/07/2022 00:00:00 03/25/2022 00:00:00 1 3074312 12 923088/ABC D J J 199.2600 0.0000 0.0000 14.5600 850785.1500 J N J N 189.0120 J 03/25/2022 11:48:55 ABC003 0.0000
15 665661965 ! 04/22/2022 00:00:00 05/06/2022 00:00:00 04/22/2022 00:00:00 1 627921 12 27160 D J J 567.3400 0.0000 0.0000 45.6800 2252133.2900 J N J N 567.3400 J 04/20/2022 12:43:09 ABC003 0.0000
16 665661976 ! 04/22/2022 00:00:00 05/06/2022 00:00:00 04/22/2022 00:00:00 2 627942 12 1734793zy D J N 223.4800 0.0000 0.0000 23.4500 416715.9100 J J 234.5600 J 04/21/2022 12:04:19 ABC003 0.0000
17 665661978 ! 04/29/2022 00:00:00 05/13/2022 00:00:00 04/29/2022 00:00:00 2 627998 12 44524 fg D J N 226.3000 0.0000 0.0000 5.3700 162912.0800 J N J N 234.2000 J 04/21/2022 12:12:44 ABC003 0.0000
18 665661987 " 04/07/2022 00:00:00 04/19/2022 00:00:00 04/07/2022 00:00:00 14 0 0 D J J 78.6500 0.0000 0.0000 1.3400 56249.8400 N J 78.6500 N 04/08/2022 12:32:28 PKDICK1 0.0000

csv.writerows() puts newline after each row

This problem occurs only with Python on Windows.

In Python v3, you need to add newline='' in the open call per:

Python 3.3 CSV.Writer writes extra blank rows

On Python v2, you need to open the file as binary with "b" in your open() call before passing to csv

Changing the line

with open('stocks2.csv','w') as f:

to:

with open('stocks2.csv','wb') as f:

will fix the problem

More info about the issue here:

CSV in Python adding an extra carriage return, on Windows

Python: write.csv adding extra carriage return

Default line terminator for csv.writer is '\r\n'. Explicitly specify lineterminator argument if you want only '\n':

wr = csv.writer(csvFile, delimiter=';', lineterminator='\n')

need to read CSV with carriage returns as data using Python

Assuming based on your description that every row should be 4 fields wide. You could just replace all the new lines with commas then use range to generate the index number of every 4th field. You can then use that to get the parameter name and put the next 3 fields in a list. The below is just a quick example of how you could do this. But of course to be more clean and not worry about nested commas etc you could still use CSV reader to parse the data and then iterate it like this.

This solution does assume that you can read the entire file into memory. If you are talking about significantly large files then let me know as a different solution would be needed to read the file line by line

# Read the entire file into memory (hoping these are not large files :D)
with open("Data.csv") as my_csv_file:
data = my_csv_file.read()

# get the index of the first line and collect the data in the first line and split it
# so we can work out the nuber of fields per record as all records will have same num fields
index_of_end_of_first_line = data.find("\n")
num_fields = len(data[:index_of_end_of_first_line].split(','))

# Replace all new lines with commas and start an empty dict
data_fields = data.replace("\n", ",").split(',')
data_dict = {}

#loop over all the fields picking N number of fields at a time based on num_fields value
for index in range(0, len(data_fields), num_fields):
data_dict[data_fields[index]] = data_fields[index + 1:index + num_fields]
print(data_fields[index:index + num_fields])
print(data_dict)

OUTPUT

['Results Table 1', '1', '2', '3']
['Operator', 'name1', 'name2', 'name3']
['Test Date', '2/26/2020', '2/26/2020', '2/26/2020']
['Test Temperature', '70', '70', '70']
['Relative Humidity (%)', '25.00', '25.00', '25.00']
['Test Pressure', 'Ambient', 'Ambient', 'Ambient']
['Comments', '', '', '']
['Failure Location', 'Advancing', 'Advancing', 'Advancing']
['Tensile stress at Maximum Load (ksi)', '47.86', '46.04', '45.49']
['Force at Maximum Load (kip)', '9.20', '8.81', '8.70']
{'Results Table 1': ['1', '2', '3'], 'Operator': ['name1', 'name2', 'name3'], 'Test Date': ['2/26/2020', '2/26/2020', '2/26/2020'], 'Test Temperature': ['70', '70', '70'], 'Relative Humidity (%)': ['25.00', '25.00', '25.00'], 'Test Pressure': ['Ambient', 'Ambient', 'Ambient'], 'Comments': ['', '', ''], 'Failure Location': ['Advancing', 'Advancing', 'Advancing'], 'Tensile stress at Maximum Load (ksi)': ['47.86', '46.04', '45.49'], 'Force at Maximum Load (kip)': ['9.20', '8.81', '8.70']}


Related Topics



Leave a reply



Submit