Ruby CSV Parsing String with Escaped Quotes

Ruby CSV parsing string with escaped quotes

CSV supports "converters", which we can normally use to massage the content of a field before it's passed back to our code. For instance, that can be used to strip extra spaces on all fields in a row.

Unfortunately, the converters fire off after the line is split into fields, and it's during that step that CSV is getting mad about the embedded quotes, so we have to get between the "line read" step, and the "parse the line into fields" step.

This is my sample CSV file:

ID,Name,Country
173,"Yukihiro \"The Ruby Guy\" Matsumoto","Japan"

Preserving your CSV.foreach method, this is my example code for parsing it without CSV getting mad:

require 'csv'
require 'pp'

header = []
File.foreach('test.csv') do |csv_line|

row = CSV.parse(csv_line.gsub('\"', '""')).first

if header.empty?
header = row.map(&:to_sym)
next
end

row = Hash[header.zip(row)]
pp row
puts row[:Name]

end

And the resulting hash and name value:

{:ID=>"173", :Name=>"Yukihiro \"The Ruby Guy\" Matsumoto", :Country=>"Japan"}
Yukihiro "The Ruby Guy" Matsumoto

I assumed you were wanting a hash back because you specified the :headers flag:

CSV.foreach('my.csv', headers: true, header_converters: :symbol) do |row|

Ruby: CSV parser tripping over double quotation in my data

The source CSV is malformed, quotes should be escaped before.

I would edit the file before parsing it with CSV and remove quotes between commas, and replace double quotes with simple ones, you can create a new file in case you don't want to edit the original.

def fix_csv(file)
out = File.open("fixed_"+file, 'w')
File.readlines(file).each do |line|
line = line[1...-2] #remove beggining and end quotes
line.gsub!(/","/,",") #remove all quotes between commas
line.gsub!(/"/,"'") #replace double quotes to single
out << line +"\n" #add the line plus endline to output
end

out.close
return "fixed_"+file
end

In case you want to modify the same CSV file, you can do it this way:

require 'tempfile'
require 'fileutils'

def modify_csv(file)
temp_file = Tempfile.new('temp')
begin
File.readlines(file).each do |line|
line = line[1...-2]
line.gsub!(/","/,",")
line.gsub!(/"/,"'")
temp_file << line +"\n"
end
temp_file.close
FileUtils.mv(temp_file.path, file)
ensure
temp_file.close
temp_file.unlink
end
end

This is explained here in case you want to take a look, this will fix or sanitize your original CSV file

parse csv with commas, double quotes and encoding

I think you do have MacRoman encoded data; if you do this in irb:

>> "\x97".force_encoding('MacRoman').encode('UTF-8')

you get this:

=> "ó"

And that seems to be the character that you're expecting. So you want this:

input_string = File.read("../csv_parse.rb").force_encoding('MacRoman').encode('UTF-8')

Then you have two columns in your CSV, the columns are quoted with double quotes (so you don't need :quote_char), and the delimiter is ', ' so this should work:

data = CSV.parse(input_string, :col_sep => ", ")

and data will look like this:

[
["Name", "main-dialogue"],
["Marceu", "Give it to him ó he, his wife."]
]

ruby: read cvs with quotes given a column name

Your issue is occurring because you have "cheated" in how you are parsing the escaped quotation marks in the CSV file, by use of :quote_char => "'". The quote character is not ', it is still just the default of "!

Without fixing this mistake, you could access each column by its header, by including the extra quotation mark that you've inadvertently included:

CSV.foreach(a, :headers => true, :quote_char => "'") do |row|
row['"title"']
end

However, as outlined in this post, a better solution is to first convert the data into a valid ruby CSV structure and then access it as normal - i.e. something like:

text = File.read('your-input-file.csv').gsub(/\\"/,'""')
CSV.parse(text, headers: true) do |row|
b.push(row['title'])
end

How to escape both and ' when importing each row

Note that you have additional options available to configure the CSV handler. The useful options for specifying character delimiter handling are these:

  • :col_sep - defines the column separator character
  • :row_sep - defines the row separator character
  • :quote_char - defines the quote separator character

Now, for traditional CSV (comma-separated) files, these values default to { col_sep: ",", row_sep: "\n", quote_char: "\"" }. These will satisfy many needs, but not necessarily all. You can specify the right set to suit your well-formed CSV needs.

However, for non-standard CSV input, consider using a two-pass approach to reading your CSV files. I've done a lot of work with CSV files from Real Estate MLS systems, and they're basically all broken in some fundamental way. I've used various pre- and post-processing approaches to fixing the issues, and had quite a lot of success with files that were failing to process with default options.

In the case of handling single quotes as a delimiter, you could possibly strip off leading and trailing single quotes after you've parsed the file using the standard double quotes. Iterating on the values and using a gsub replacement may work just fine if the single quotes were used in the same way as double quotes.

There's also an "automatic" converter that the CSV parser will use when trying to retrieve values for individual columns. You can specify the : converters option, like so: { converters: [:my_converter] }

To write a converter is pretty simple, it's just a small function that checks to see if the column value matches the right format, and then returns the re-formatted value. Here's one that should strip leading and trailing single quotes:

CSV::Converters[:strip_surrounding_single_quotes] = lambda do |field|
return nil if field.nil?

match = field ~= /^'([^']*)'$/
return match.nil? ? field : match[1]
end

CSV.parse(input, { converters: [:strip_surrounding_single_quotes] }

You can use as many converters as you like, and they're evaluated in the order that you specify. For instance, to use the pre-defined :all along with the custom converter, you can write it like so:

CSV.parse(input, { converters: [:all, :strip_surrounding_single_quotes] }

If there's an example of the input data to test against, we can probably find a complete solution.



Related Topics



Leave a reply



Submit