Importing CSV Quoting Error Is Driving Me Nuts

quoting error importing JSON formatted column of geographical object via CSV into rails + postgresql/postgis

Using CSV is actually making one go through pointless hoops. Using plain ol' Ruby is much more direct, given the extension involved

task :load_geo_data => :environment do
  tempdir = File.absolute_path('uploads/regions.tsv')
  File.open(tempdir).each do |line|
    feature = RGeo::GeoJSON.decode(line)
    Regionpolygon.create(
       rawdata: feature.geometry.as_text,
       [...]
     )
  end
end

Error in Reading a csv file in pandas[CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.]

Not an answer, but too long for a comment (not speaking of code formatting)

As it breaks when you read it in csv module, you can at least locate the line where the error occurs:

import csv
with open(r"C:\work\DATA\Raw_data\store.csv", 'rb') as f:
    reader = csv.reader(f)
    linenumber = 1
    try:
        for row in reader:
            linenumber += 1
    except Exception as e:
        print (("Error line %d: %s %s" % (linenumber, str(type(e)), e.message)))

Then look in store.csv what happens at that line.

Parsing csv file in Ruby

That message is technically correct. Quotes have special meaning for the CSV format - they would allow you to embed separator characters in the data. Any quotes used within a field therefore need to be escaped if they are part of the data, or the CSV parser should be informed to use some other character for quoting, in which case it will treat any " that it sees as literal data.

If you don't need to support pipes actually within each field, and have some other unused character you can shift this problem off to, Ruby's CSV can be made to consume your (slightly) malformed csv format:

CSV.parse(data, {:col_sep => '|', :quote_char => "%" })

Otherwise, the correct quoting for your problem line is

|"Some ""quoted name"""|2|12|Machine|

Importing CSV-file as list returns empty list

This is because a variable declared inside a function is only available inside that function and not in the global scope. In this case my suggestion would be to return the data from the function after you have finished reading the file. In other words something like this:

import csv

def read_file():
    with open ('filepath', 'r') as file:
        csv_reader = csv.reader(file, delimiter=';') 
        data=[] 
        for rad in csv_reader: 
            data.append(rad)

    return data

file_data = read_file()
print(file_data)

It is also possible to make the data variable inside the function global, however this is normally not recommended due to global variables quickly becoming hard to keep track of, it is much easier to see where data is coming from when it is returned from a function like this.

Line breaks in generated csv file driving me crazy

This works for me:

a) Setting Response.ContentEncoding = System.Text.Encoding.UTF8 isn't enough to make Excel open UTF-8 files correctly. Instead, you have to manually write a byte-order-mark (BOM) header for the excel file:

if (UseExcel2003Compatibility)
    {
        // write UTF-16 BOM, even though we export as utf-8. Wrong but *I think* the only thing Excel 2003 understands
        response.Write('\uFEFF');
    }
    else
    {
        // use the correct UTF-8 bom. Works in Excel 2008 and should be compatible to all other editors
        // capable of reading UTF-8 files
        byte[] bom = new byte[3];
        bom[0] = 0xEF;
        bom[1] = 0xBB;
        bom[2] = 0xBF;
        response.BinaryWrite(bom);
    }

b) send as octet-stream, use a filename with .csv extension and do quote the filename as is required by the HTTP spec:

response.ContentType = "application/octet-stream";
response.AppendHeader("Content-Disposition", "attachment; filename=\"" + fileName + "\"");

c) use double quotes for all fields

I just checked and for me Excel opens downloaded files like this correctly, including fields with line breaks.

But note that Excel still won't open such CSV correctly on all systems that have a default separator different to ",". E.g. if a user is running Excel on a Windows system set to German regional settings, Excel will not open the file correctly, because it expects a semicolon instead of a comma as separator. I don't think there is anything that can be done about that.

Ruby CSV fails on fields like =1234

Using gsub should be enough:

#!/usr/bin/env ruby

require 'csv'

data = File.read('file.csv').gsub(/=("[^"]*")/, '\\1')

CSV.parse(data).each do |e|
  puts e.inspect
end

Output:

["Product Code", "Product Name", "Retail Price", "Tax Percentage", "Option Name", "Option Type"]
["20042", "Blossom Wall Art", "245.00", "1", "", ""]

Importing CSV with line breaks in Excel 2007

I have finally found the problem!

It turns out that we were writing the file using Unicode encoding, rather than ASCII or UTF-8. Changing the encoding on the FileStream seems to solve the problem.

Thanks everyone for all your suggestions!