Ruby: How to Process a CSV File with "Bad Commas"

Ruby: How can I process a CSV file with bad commas?

Well, here's an idea: You could replace each instance of comma-followed-by-a-space with a unique character, then parse the CSV as usual, then go through the resulting rows and reverse the replace.

How to Parse with Commas in CSV file in Ruby

The illegal quoting error is when a line has quotes, but they don't wrap the entire column, so for instance if you had a CSV that looks like:

NCB 14591  BLK 13  LOT W IRR," 84.07 FT OF 25, ALL OF 26,",TWENTY-THREE SAC HOLDING COR
NCB 14592 BLK 14 LOT W IRR,84.07 FT OF "25",TWENTY-FOUR SAC HOLDING COR

You could parse each line individually and change the quote character only for the lines that use bad quoting:

require 'csv'

def parse_file(file_name)
File.foreach(file_name) do |line|
parse_line(line) do |x|
puts x.inspect
end
end
end

def parse_line(line)
options = { encoding:'iso-8859-1:utf-8' }
begin
yield CSV.parse_line(line, options)
rescue CSV::MalformedCSVError
# this line is misusing quotes, change the quote character and try again
options.merge! quote_char: "\x00"

retry
end
end

parse_file('./File.csv')

and running this gives you:

["NCB 14591  BLK 13  LOT W IRR", " 84.07 FT OF 25, ALL OF 26,", "TWENTY-THREE SAC HOLDING COR"]
["NCB 14592 BLK 14 LOT W IRR", "84.07 FT OF \"25\"", "TWENTY-FOUR SAC HOLDING COR"]

but then if you have a mix of bad quoting and good quoting in a single row this falls apart again. Ideally you just want to clean up the CSV to be valid.

Rails Rake Task How to parse CSV with commas in fields

Right, so from what I am seeing you are, as you understand yourself, not passing any options to the parser. When not indicating row_sep or any other form of option, smarter_csv will use the system new line separator which is "\r\n" for windows machines, and "\r" for unix machines.

That being said, try the following...

require 'smarter_csv'
SmarterCSV.process('input_file.csv', :row_sep => :auto, :row_sep => ","} do |chunk|
chunk.each do |data_hash|
Moulding.create!( data_hash )
end
end

I agree with Swards. What I have done assumes quite a lot of things. A glance at some CSV data could be useful.

parse csv with commas, double quotes and encoding

I think you do have MacRoman encoded data; if you do this in irb:

>> "\x97".force_encoding('MacRoman').encode('UTF-8')

you get this:

=> "ó"

And that seems to be the character that you're expecting. So you want this:

input_string = File.read("../csv_parse.rb").force_encoding('MacRoman').encode('UTF-8')

Then you have two columns in your CSV, the columns are quoted with double quotes (so you don't need :quote_char), and the delimiter is ', ' so this should work:

data = CSV.parse(input_string, :col_sep => ", ")

and data will look like this:

[
["Name", "main-dialogue"],
["Marceu", "Give it to him ó he, his wife."]
]

Escape Comma from CSV in Ruby

Use Ruby's built-in to_csv method.

If you haven't already done so, you'll need to require 'csv'.

Sell Date, Sell Amount
- @rows.each do |row|
= [ row[0], number_to_currency(row[1], :precision => 2) ].to_csv( row_sep: nil ).html_safe

to_csv is available right on the Array and does all the escaping you'd expect it to do.

row_sep: nil prevents the \n at the end of each row since you're already doing that with each. Try it without that and you'll see that you get an extra blank line. If you were just generating a single CSV string then you'd need to keep the \n to separate the rows.

html_safe prevents the " characters from showing up in your CSV file.

That should do it!

JP

Ruby CSV not reading comma-formatted numbers in quoted strings

You can add a custom converter that handles your number columns correctly. Not sure if this covers all of your possible formatting options, but it'd look something like this:

Create a lambda:

comma_numbers = ->(s) {(s =~ /^\d+,/) ? (s.gsub(',','').to_f) : s}

Add it to your converters:

CSV::Converters[:comma_numbers] = comma_numbers

The new converter is not included in converters: :all so add it as an array:

converters: [:all, :comma_numbers]

Ruby : sanitize CSV with irregular fields

Well, I ended up with :

processed = File.readlines(path).map do |row|
row.strip.gsub('""', '"')[1..-2]
end.join("\n")
CSV.parse(processed)

The [1..-2] just removes the extra " at the beginning/end of the line that was messing up things

What's a semantically-correct way to parse CSV from SQL Server 2008?

The following uses regexp and String#scan. I observe that in the broken CSV format you're dealing with, that " only has quoting properties when it comes at the beginning and end of a field.

Scan moves through the string successively matching the regexp, so the regexp can assume its start match point is the beginning of a field. We construct the regexp so it can match a balanced quoted field with no internal quotes (QUOTED) or a string of non-commas (UNQUOTED). When either alternative field representation is matched, it must be followed by a separator which can be either comma or end of string (SEP)

Because UNQUOTED can match a zero length field before a separator, the scan always matches an empty field at the end which we discard with [0...-1]. Scan produces an array of tuples; each tuple is an array of the capture groups, so we map over each element picking the captured alternate with matches[0] || matches[1].

None of your example lines show a field which contains both a comma and a quote -- I have no idea how it would be legally represented and this code probably wont recognize such a field correctly.

SEP = /(?:,|\Z)/
QUOTED = /"([^"]*)"/
UNQUOTED = /([^,]*)/

FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/

def ugly_parse line
line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] }
end

lines.each do |l|
puts l
puts ugly_parse(l).inspect
puts
end

# Electrical,197135021E,"SERVICE, OUTLETS",1997-05-15 00:00:00
# ["Electrical", "197135021E", "SERVICE, OUTLETS", "1997-05-15 00:00:00"]
#
# Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
# ["Plumbing", "196222006P", "REPLACE LEAD WATER SERVICE W/1\" COPPER", "1996-08-09 00:00:00"]
#
# Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
# ["Construction", "197133031B", "MORGAN SHOES\" ALT", "1997-05-13 00:00:00"]

Writing to csv is adding quotes

If you get rid of the quotes then your output is no longer CSV. The CSV class can be instructed to use a different delimiter and will only quote if that delimiter is included in the input. For example:

require 'csv'
output = "This is a, ruby output"
File.open("output/abc.csv", "a+") do |io|
csv = CSV.new(io, col_sep: '^')
csv << [output, "the end"]
end

Output:

This is a, ruby output^the end


Related Topics



Leave a reply



Submit